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I.  Introduction 

Computer-aided  diagnosis  (CAD)  systems  for  mammography,  under  development  for  more  than  10  years, 
are  an  approach  for  low-cost  double-reading  with  potential  to  improve  the  detection  of  breast  cancer. 
Though  results  to  date  have  been  promising,  current  systems  often  suffer  from  unacceptably  high  false 
positive  rates  and  lower  than  expected  sensitivity  and  specificity  when  evaluated  on  new  data.  Improved 
methods  are  needed  for  optimally  setting  the  system  parameters,  particularly  in  the  case  of  statistical 
models  and  neural  networks  which  are  common  elements  of  most  CAD  systems.  This  research  project 
looks  to  apply  principles  from  information  theory  to  build  improved  statistical  models  for  CAD  systems. 

Specifically,  we  develop  a  framework  for  building  hierarchical  pattern  recognizers  based  on  information 
theoretic  criteria.  The  best-known  example  of  such  criteria  is  the  minimum  description  length  principal 
{MDL)  pioneered  by  Rissanen  [9].  Using  these  criteria  we  have  developed  a  framework  for  building 
generative  hierarchical  image  probability  (HIP)  models.  Since  the  HIP  framework  is  a  generative  model, 
i.e.,  it  directly  models  the  probability  of  the  image  given  the  image  class;  it  is  well-suited  to  compression 
and  thus  application  of  MDL.  Along  with  conventional  MDL  we  have  evaluated  predictive  MDL  {pMDL) 
and  Akaiki’s  information  Criterion  (A/C).  Although  these  are  used  to  select  the  complexity  of  the  model, 
we  have  also  applied  information  theoretic  criteria  to  select  the  wavelet  basis  on  which  these  models  are 
built.  We  applied  these  techniques  to  the  problems  of  microcalcification  and  mass  detection.  We  evaluated 
these  techniques  using  mammographic  mass  and  microcalcification  datasets  from  The  University  of 
Chicago  (UofC)  and  in  all  cases  performance  has  been  evaluated  relative  to  the  UofC  CAD  system  [24], 
i.e.,  the  HIP  model  augments  the  UofC  CAD  system. 

We  also  rigorously  evaluated  the  generative  properties  of  the  model  for  image  synthesis  and  novelty 
detection.  Analysis  of  the  HIP  model  for  synthesizing  new  mammographic  images  is  important  for 
understanding  how  the  model  captures  image  structure  specific  to  mammographic  masses.  Novelty 
detection  is  particularly  relevant  since  it  would  enables  our  system  to  establish  confidence  measures  on 
detection,  something  which  most  current  CAD  systems  do  not  offer.  Finally  we  briefly  tested  the  models 
that  we  trained  for  classification  on  the  rather  different  problem  of  compression,  to  demonstrate  the 
flexibility  of  this  approach. 

II.  Body 

The  following  are  the  three  primary  tasks  under  the  first  year  of  the  project. 

1 .  Apply  and  evaluate  the  utility  of  our  hierarchical  pyramid  neural  network  (HPNN)  architecture  for 
improving  mass  detection  in  a  CAD  system. 


*  With  help  from  Paul  Sajda,  Department  of  Biomedical  Engineering,  Columbia  University,  New  York  NY, 
10027. 
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2.  Develop  basic  MDL  framework  within  context  of  building  models  for  CAD  applications-development 
of  HIP  framework. 

3.  Apply  MDL  framework  to  select  optimal  number  of  nodes  (labels)  for  statistical  (HIP)  models. 

The  following  are  the  two  primary  tasks  completed  under  the  second  year  of  this  project: 

1.  Further  develop  and  evaluate  the  hierarchical  image  probability  model,  specifically  focusing  on  the 
generative  aspects  of  its  architecture. 

2.  Apply  and  evaluate  MDL  framework  for  selecting  architecture  of  hierarchical  model.  Compare  MDL 
framework  with  other  model  selection  methods. 

The  primary  tasks  of  year  three  were  as  follows: 

1.  Apply  MDL  selection  framework  to  select  wavelet  packet  bases  from  wavelet  libraries,  and  train  HIP 
models  with  the  resulting  representation. 

2.  Examine  ROIs  and  correlate  ability  to  detect  certain  physical  structure  with  information  constraint 
learned  by  model. 

3.  Perform  ROC  analysis  and  qualitative  inspection  of  ROIs  to  determine  improvement  in  models’ 
performance.  Analyze  on  UofC  clinical  dataset  to  determine  if  generalization  performance  of  new  data 
has  improved. 

In  the  following  sections  we  describe  our  progress  in  accomplishing  these  tasks.  We  refer  to  our  year  1 
report  [25]  or  the  papers  included  in  the  Appendix  for  a  detailed  description  of  the  HIP  model. 

A.  MDL  and  AlC 

We  begin  with  a  discussion  of  information  theoretic  criteria  for  selecting  between  alternative  models. 

There  are  at  least  two  such  criteria:  the  Minimum  Description  Length  {MDL)  criterion  and  the  Akaike’s 
Information  Criterion  {AIQ.  We  have  investigated  the  usefulness  of  these  criteria  for  choosing 
Hierarchical  Image  Probability  {HIP)  models  for  classifying  mammographic  mass  and  microcalcification 
Regions  Of  Interest  {ROh).  A  typical  result  is  shown  in  Figure  1.  Both  MDL  and  AIC  track  test  Az 
performance — ^MDL  and  AIC  cost  decrease  as  Az  performance  on  the  test  set  increases.  In  the  following 
we  describe  the  two  criteria  and  then  suggest  a  methods  to  further  improve  the  information  theoretic 
selection  criteria. 

1.  MDL 

The  minimum  description  length  of  a  set  of  data  is  the  length  of  the  data  encoded  according  to  some 
probability  model,  which  is  the  model  we  are  trying  to  fit  to  the  data,  plus  the  length  of  the  description  of 
the  model  (Rissanen,  1983;  Rissanen,  1996).  The  length  of  the  encoding  of  the  data  is  the  negative  log 
probability  density  of  the  data  according  to  the  model,  plus  a  constant  representing  the  precision  with  which 
the  data  must  be  specified.  We  ignore  this  constant  when  doing  model  selection,  since  it  is  the  same  for  all 
models. 

The  code  length  of  the  model  has  two  components,  a  term  for  coding  the  architecture,  and  a  term  for 
encoding  the  parameters.  Suppose  we  are  comparing  models  with  different  structures.  For  example,  we 
may  be  comparing  mixture  density  models  with  different  numbers  of  mixture  components.  We  will  call  the 
different  models  architectures.  In  this  example,  the  number  of  mixture  components  needs  to  be  encoded, 
and  in  general  the  specific  architecture  must  be  encoded.  In  practice  this  is  often  ignored,  since  it  is  a  small 
contribution  to  the  total  description  length. 
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Figure  1,  Information  theoretic  model  selection  using  AIC  (red)  and  MDL  (blue).  Plotted  is  model 
cost  vs.  Az  on  the  test  data.  MDL  would  choose  a  model  with  test  A^  =  0.75  while  AIC  would  choose  a 
model  with  Az=0.78.  (The  uncertainty  in  these  estimates  is  probably  greater  than  the  difference.) 

Given  an  architecture,  we  need  to  encode  the  parameter.  The  Cramer-Rao  bound  gives  a  lower  limit  on  the 
variance  of  the  parameters  about  their  true  value,  assuming  that  the  true  probability  is  equal  to  our 
architecture  with  some  values  for  the  parameters.  This  limit  is  the  inverse  M'^  of  the  Fisher  information 
matrix  M, which  is  the  negative  expected  value  of  the  second  derivative  (or  Hessian)  with  respect  to  the 
parameters  of  the  log  probability  of  the  data  according  to  the  model,  evaluated  at  the  true  value  of  the 
parameters.  The  precision  with  which  we  encode  the  parameters  need  not  be  greater  than  the  precision  with 
which  we  know  diem,  i.e.,  it  need  not  be  greater  than  the  standard  deviations  given  by 

M'^.  Thus  we  would  compute  the  components  of  the  parameter  vector  along  the  eigenvectors  of  M'^,  and 
the  precision  of  these  components  are  given  by  the  square  roots  of  the  eigenvalues.  The  total  code  length 
of  the  parameters  is  the  sum  of  the  logarithms  of  these  precisions,  which  is  the  log  of  the  square  root  of  the 
determinant  of  M'^.  The  code  length  for  the  parameters  0  is  the  negative  log  of  the  square  root  of  this 
volume,  or 


1/2  1 

-log(|M-^|  )  =  -log(|M|). 


(1) 


Since  it  involves  the  probability  of  all  of  the  data,  M  is  proportional  to  the  number  of  examples  N,  at  least 
when  there  are  enough  examples.  Because  of  this  we  can  pull  out  the  dependence  on  N.  If  there  are  d 
parameters,  this  gives 
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=  -|log(iV)  +  ilog| 


M 
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(2) 


The  second  term  is  constant  in  the  limit  of  large  N,  so  in  that  limit  we  can  ignore  it.  The  remaining  term  is 
straighforward  to  compute,  since  we  only  need  to  know  the  number  of  parameters  and  the  number  of 
training  examples.  The  total  code  length  for  MDL  is  therefore, 


MDL  =  -2  log  P(Xi  1 0)  +  -  log  iV 
1=1  2 


(3) 


a.  Predictive  MDL  (PMDL) 

As  stated  earlier,  minimum  description  length  techniques  lend  themselves  well  to  HIP  models  because  a 
description  length  of  the  images  given  the  HIP  model  naturally  encodes  the  compactness  of  the  HIP 
distribution  along  with  the  likelihood  of  the  data  under  the  HIP  distribution.  MDL  therefore  gives  us  a 
natural  means  for  making  various  architecture  choices,  e.g.,  the  number  of  labels  at  each  level  in  the 
hierarchy,  the  types  of  features  to  use,  and  so  on.  In  our  first  experiments  we  chose  to  use  a  predictive 
MDL  or  PMDL  approach,  due  to  its  apparent  simplicity.  We  take  a  training  set  S,  say  all  of  the  mass  ROIs 
in  the  complete  training  set,  and  give  the  images  within  it  some  ordering.  We  then  train  the  HIP  model  on 
the  first  images  in  S  and  test  it  on  some  of  the  succeeding  images.  The  test  results  in  a  log-likelihood  for 
these  test  images,  which  we  then  use  to  initialize  a  running  sum  of  test  log-likelihoods.  We  then  re-train  on 
the  first  images  plus  the  images  on  which  we  already  tested,  and  test  on  more  of  the  images,  again  adding 
the  test  result  to  the  running  sum  of  log-likelihoods.  We  repeat  this  until  we  have  tested  on  the  last  images 
in  S. 

It  has  been  shown  [10]  that  the  result  is  asymptotically  equal  to  the  description  length  of  the  model  and  the 
data  under  the  final  trained  model.  Intuitively,  one  expects  a  U-shaped  curve  as  a  function  of  model 
complexity.  A  model  that  is  too  simple  will  give  relatively  poor  results  late  in  the  PMDL  procedure,  since  it 
can’t  adequately  fit  the  data.  A  model  that  is  too  complex  will  give  relatively  poor  results  early  in  the 
PMDL  procedure,  since  it  over  fits  the  data,  and  in  fact  overfits  it  for  more  iterations  than  a  simpler  model. 
Thus  models  that  are  either  too  simple  or  too  complex  will  have  relatively  high  values  for  the  accumulated 
test  error.  This  intuition  does  not,  of  course,  guarantee  that  the  optimal  model  according  to  PMDL  will 
generalize  optimally,  given  the  training  data. 

Compared  to  leave-one-out  cross-validation  PMDL  should  be  quite  fast  because  it  starts  each  re-training 
run  at  an  architecture  that  was  optimized  on  a  large  fraction  of  the  new  training  set.  Besides  the  speed 
advantage,  Rissanen  claims  that  PMDL  is  more  reliable  than  cross-validation  [10].  We  applied  PMDL  to 
choosing  the  number  of  components  or  labels  at  each  level  in  HIP  models  for  positive  and  negative  ROIs, 
and  we  give  those  results  in  the  experimental  section. 

Though  PMDL  seems  simple,  in  fact  for  models  of  probability  distributions  ordinary  MDL  is 
straightforward,  so  we  used  plain  MDL  later  in  this  project. 


2.  AlC 

Akaike's  Information  Criterion  is  the  expected  Kullback-Leibler  distance  between  the  true  model  and  the 
best  model  of  the  current  architecture,  given  the  data  set.  It  is  assumed  that  the  architectures  form  nested 
sets  with  the  true  distribution  being  a  member  of  one  of  these  sets,  that  the  number  of  examples  N  is 
sufficiently  large,  and  that  the  current  model  is  not  too  far  from  the  true  distribution.  The  resulting  criterion 
is 


AIC  =  -22logP(x,.  \e)+2d 

i=l 


(4) 
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3.  Deficiencies  of  the  information  criteria 

Both  MDL  and  AIC  assume  that  there  are  sufficent  examples,  N.  Treatments  of  MDL,  for  example, 
sometimes  use  the  term  "asymptotically",  which  implies  the  number  of  examples  goes  to  infinity  for  a  fixed 
model  (Rissanen,  1996).  Thus  we  should  only  expect  to  get  good  results  from  these  criteria  if  we  have 
enough  examples  and  we  find  a  best  model  before  trying  models  that  are  too  large  for  the  amount  of  data. 

In  our  current  experiments  we  are  not  obviously  in  this  situation.  We  have  a  fixed  number  of  examples, 
and  we  are  varying  the  model  complexity.  There  is  no  criterion  for  deciding  whether  we  have  enough 
examples,  or,  alternately,  when  we  have  too  complex  a  model  for  the  criteria  to  be  valid. 

One  possible  method  to  address  this  “asymptotic”  issue  is  to  add  corrections  to  the  criteria  For  MDL,  the 
correction  is  clear:  include  the  second  term  from  Equation  (2).  This  is  certainly  more  complex,  but  it  is 
feasible.  For  the  HIP  model  it  should  be  possible  to  estimate  the  Hessian  numerically. 


B.  HPNN 

Prior  to  this  project  [1 1]  we  had  developed  a  coarse-to-fine  hierarchical  pyramid/neural  network  (HPNK) 
architecture  that  combines  multi-scale  image  processing  techniques  with  neural  networks  to  detect 
microcalcifications  in  digital/digitized  mammograms  (see  Figure  2A).  To  search  an  image  we  apply  the 
network  at  a  position  and  use  its  output  as  an  estimate  of  the  probability  that  a  microcalcification  is  present. 
We  then  repeat  this  at  each  position  in  the  image.  In  the  coarse-to-fme  HPNN,  the  hidden  units  of  networks 
operating  at  low  resolution  or  coarse  scale  learn  associated  context  information,  since  the  targets 
themselves  are  difficult  to  detect  at  low  resolution.  The  context  is  then  passed  to  networks  searching  at 
higher  resolution.  The  use  of  context  can  significantly  improve  detection  performance  since 
microcalcifications  have  few  distinguishing  features.  In  the  HPNN,  each  of  the  networks  receives 
information  directly  from  only  a  small  part  of  several  feature  images  and  so  the  networks  can  be  relatively 
simple.  The  network  at  the  highest  resolution  integrates  the  contextual  information  learned  at  coarser 
resolutions  to  detect  the  object  of  interest. 

Under  this  project,  we  have  extended  the  HPNN  architecture  by  inverting  the  information  flow  in  the 
coarse-to-fme  architecture.  This  fme-to-coarse  HPNN  has  networks  extracting  detail  structure  at  fine 
resolutions  of  the  image  and  then  passing  this  detail  information  to  networks  operating  at  coarser  scales 
(see  Figure  2B).  This  is  useful  for  many  types  of  objects,  such  as  mammographic  masses,  for  which 
information  about  the  fine  structure  is  important  for  discriminating  between  different  classes.  Radiologists 
often  distinguish  malignant  from  benign  masses  based  on  the  detailed  shape  of  the  mass  border  and  the 
presence  of  spicules  alone  the  border.  Thus  the  fine-to-coarse  HPNN  should  be  well  suited  to  this  problem. 

At  each  level  of  the  fine-to-coarse  HPNN  several  hidden  units  process  the  feature  images.  The  outputs  of 
each  unit  at  all  of  the  positions  in  an  image  make  up  a  new  feature  image.  This  is  reduced  in  resolution  by 
the  usual  pyramid  blur-and-subsample  operation  to  make  an  input  feature  image  for  the  network  units  at  the 
next  lower  resolution.  We  trained  the  entire  fine-to-coarse  HP!^  as  one  network  instead  of  training  a 
network  for  each  level,  one  level  at  a  time. 

This  training  is  quite  straightforward.  Back-propagating  error  through  the  network  units  is  the  same  as  in 
conventional  networks.  We  must  also  back-propagate  through  the  pyramid  reduction  operation,  but  this  is 
linear  and  therefore  quite  simple.  In  addition  we  use  the  same  UOP  error  function  used  in  our  previous 
work  to  train  the  coarse-to-fine  architecture  [12].  The  rationale  for  this  application  of  the  UOP  error 
function  is  that  the  truth  data  specifies  the  location  of  the  center  of  the  mass  at  the  highest  resolution. 
However,  because  of  the  sub-sampling  the  center  cannot  be  unambiguously  assigned  to  a  particular  pixel  at 
low  resolution. 


5 


r 


P(t) 


Figure  2.  HPNN  architectures.  (A)  The  coarse-to-fine  HPNN  architecture  exploits  large-scale 
context  to  help  detect  small  objects  at  fine  scales.  (B)  The  fine-to-coarse  HPNN  integrates  fine-scale 
details  to  detect  extended  objects. 

The  features  input  to  the  fme-to-coarse  HPNN  are  filtered  versions  of  the  image,  with  filter  kernels  given  in 
polar  coordinates  by 


Vq,p(r,9)  = 


A/2 


n{q^\p\)\  j 


{r^)e 


ip  9 


(5) 


where 


is  an  associated  Laguerre  polynomial.  These  have  several  convenient  features.  First,  they  are 


complete,  so  any  image  structure  can  be  described  in  terms  of  them.  Second,  they  are  combinations  of 
derivatives  of  Gaussians,  and  can  be  written  as  combinations  of  separable  filter  kernels  (products  of  purely 
horizontal  and  vertical  filters),  so  they  can  be  computed  at  relatively  low  cost.  Third,  they  are  easy  to  steer, 
since  this  is  just  multiplication  by  a  complex  phase  factor.  We  steered  these  in  the  radial  and  tangential 
directions  relative  to  the  tentative  mass  centers,  and  used  the  real  and  imaginary  parts  and  their  squares  and 
products  as  features.  The  center  coordinates  of  the  tentative  masses  are  generated  by  the  earlier  stages  of 
the  CAD  system.  These  features  were  extracted  at  each  level  of  the  Gaussian  pyramid  representation  of  the 
mass  ROI,  and  used  as  inputs  only  to  the  network  units  at  the  same  level. 


The  fme-to-coarse  HPNN  is  quite  similar  to  the  convolution  network  proposed  by  Le  Cun,  et  al  [5], 
however  with  a  few  notable  differences.  The  fme-to-coarse  HPNN  receives  as  inputs  preset  features 
extracted  from  the  image  (in  this  case  radial  and  tangential  gradients)  at  each  resolution,  compared  to  the 
convolution  network,  whose  inputs  are  the  original  pixel  values  at  the  highest  resolution.  Secondly,  in  the 
fme-to-coarse  HPNN,  the  inputs  to  a  hidden  unit  at  a  particular  position  are  the  pixel  values  at  that  position 
in  each  of  the  feature  images,  one  pixel  value  per  feature  image.  Thus  the  HPNN’s  hidden  units  do  not 
learn  linear  filters,  except  as  linear  combinations  of  the  filters  used  to  form  the  features.  Finally  the  fme-to- 
coarse  HPNN  is  trained  using  the  UOP  error  function,  which  is  not  used  in  the  Le  Cun  network. 


1 .  HPNN  Mass  detection  results 

As  for  microcalcifications  [1 1],  we  apply  the  HPNN  as  a  post-processor,  but  here  it  processes  the  output  of 
the  mass-detection  component  of  UofC  CAD  system.  The  data  in  our  study  consists  of  72  positive  and  100 
negative  ROIs.  These  are  256-by-256  pixels  and  are  sampled  at  200-micron  resolution.  Half  the  data  was 
used  for  training  and  half  for  testing. 

Currently  our  best  performing  fine-to-coarse  HPNN  system  for  mass  detection  has  two  hidden  units  per 
pyramid  level.  This  gives  an  ROC  area  (A^)  of  0.85  and  eliminates  32  %  of  the  false-positives  without  any 
loss  in  sensitivity.  When  tested  on  a  third  set  of  ROIs  (36  positives  and  200  negatives),  the  HPNN  actually 
gave  better  performance,  with  Az  increasing  to  0.89  and  eliminating  51  %  of  the  false  positives  (Figure  3). 
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Figure  3.  ROC  curves  of  HPNN  mass  detector  for  validation  set  (part  of  the  set  from  which  training 
data  was  drawn)  and  a  separate  test  set.  (Data  was  provided  by  Prof.  Maryellen  Giger  of  the 
University  of  Chicago.) 

C.  HIP 

Though  our  results  for  the  application  of  the  HPNN  to  mammographic  CAD  have  been  promising,  the 
HPNN  is  not  a  good  framework  for  applying  MDL  techniques  for  model  selection.  Since  the  HPNN 
estimates  the  class  probability  there  is  one  bit  of  information  per  example  (i.e.  the  ROI  either  has  a  mass  or 
it  does  not).  From  the  MDL  point  of  view  we  are  attempting  to  reduce  the  number  of  bits  needed  to  specify 
the  classes  of  a  set  of  images  by  using  a  model  to  estimate  those  bits  from  the  images.  Since  we  need  bits 
to  specify  the  model  and  only  one  bit  per  image  to  specify  its  class  without  compression,  we  would  need 
very  many  images  to  save  more  bits  than  those  needed  to  encode  the  model.  Most  approaches  to  object 
recognition  in  images  also  estimate  P(Ciass  |  Image),  and  so  do  not  work  well  with  MDL  techniques. 

Alternatively,  if  we  estimate  P(Image  [  Class),  we  are  encoding  an  image,  so  we  potentially  save  very  many 
bits.  In  this  case  it  is  easy  to  save  more  bits  compressing  the  image  than  it  takes  to  specify  the  model, 
possibly  even  with  a  single  image.  This  is  important,  given  that  many  MDL  techniques  are  only  valid 
asymptotically  [10],  i.e.,  with  a  large  amount  of  data.  (A  single  image  contains  a  large  amount  of  data,  in 
some  sense,  but  the  model  structure  determines  whether  this  can  be  exploited.) 

A  model  of  the  probability  distribution  of  images  has  many  other  attractive  features.  We  could  use  this  for 
object  recognition  in  the  usual  way  by  training  a  distribution  for  each  object  class  and  using  Bayes’  rule  to 
get  P(Class  1  Image).  Since  we  would  have  P(Image),  we  could  apply  the  resulting  model  to  many  different 
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tasks  without  further  training  or  model  selection.  For  example,  we  can  detect  unusual  images  and  reject 
them  rather  than  trust  the  classifier;  something  that  is  not  possible  with  models  of  P(Class  |  Image),  We 
could  also  use  the  mode  to  compress  the  images.  Since  a  model  of  the  distribution  of  data  can  be  used  to 
generate  random  examples  that  are  supposedly  typical  of  the  data,  this  type  of  model  is  often  referred  to  as 
a  generative  model. 

Though  the  HPNN  is  not  ideally  suited  for  the  application  of  MDL,  it  does  have  some  attractive  features. 
Most  importantly,  the  HPNN  is  a  framework  for  learning  and  integrating  multi-resolution  information  for 
object  classification.  For  instance,  the  HPNN  is  able  to  improve  microcalcification  detection  performance 
for  the  University  of  Chicago  CAD  system  because  it  can  exploit  low  resolution  contextual  information, 
such  as  the  location  of  blood  vessels  and  the  ductal  system  [11].  Thus  a  generative  modeling  framework 
should  also  take  advantage  of  multi-resolution  information  for  exploiting  contextual  information. 

We  have  developed  a  model  of  image  distributions  with  these  properties,  that  we  call  the  Hierarchical 
Image  Probability  or  HIP  model.  In  the  following  section  we  briefly  describe  previous  work  in  modeling 
the  probability  distributions  of  images.  We  then  describe  the  new  framework  of  HIP  models  we  have 
developed.  We  present  the  theory  behind  the  framework  and  then  our  results  in  applying  the  HIP  model  to 
mammographic  mass  and  microcalcification  detection.  We  include  discussions  of  several  topics  mentioned 
above,  namely  novelty  detection,  synthesis  of  images  as  a  means  of  investigating  the  models,  and 
compression.  We  also  present  our  investigation  of  information  theoretic  techniques  for  choosing  wavelet 
packet  bases  on  which  to  build  HIP  models. 

1 .  Theory 

a.  Previous  work  in  modeling  image  probability  distributions 

Many  image  analysis  algorithms  use  probability  concepts,  but  few  treat  the  distribution  of  images.  Zhu,  Wu 
and  Mumford  [14]  do  this  by  computing  the  maximum  entropy  distribution  given  a  set  of  statistics  for  some 
features.  This  works  well  for  textures  but  it  is  not  clear  how  well  it  will  model  the  appearance  of  more 
structured  objects.  In  addition,  with  their  approach  it  is  easy  to  compute  the  probability  of  an  image  but 
harder  to  sample  from  the  distribution,  i.e.,  generate  new  artificial  images.  The  ability  to  sample  is 
necessary  for  many  image  analysis  applications,  e.g.,  compression. 

There  are  several  algorithms  for  modeling  the  distributions  of  features  extracted  from  the  image,  instead  of 
the  image  itself.  The  Markov  Random  Field  (MRF)  models  are  an  example  of  this  line  of  development;  see, 
e.g.,  [6, 4].  Unfortunately  they  tend  to  be  very  expensive  computationally.  Because  it  is  not  an  image 
distribution  it  only  applies  to  some  image  analysis  tasks,  such  as  texture  classification,  that  do  not  require 
sampling. 

In  De  Bonet  and  Viola’s  flexible  histogram  approach  [2,  1],  features  are  extracted  at  multiple  image  scales, 
and  the  resulting  feature  vectors  are  treated  as  a  set  of  independent  samples  drawn  from  a  distribution.  They 
then  model  this  distribution  of  feature  vectors  with  Parzen  windows.  The  flexible  histogram  approach  has 
given  good  results,  but  the  feature  vectors  from  neighboring  pixels  are  treated  as  independent  when  in  fact 
they  share  exactly  the  same  components  from  lower-resolutions.  To  fix  this  one  might  build  a  model  in 
which  the  features  at  one  pixel  of  one  pyramid  level  condition  the  features  at  each  of  several  child  pixels  at 
the  next  higher-resolution  pyramid  level. 

The  multi-scale  stochastic  process  (MSP)  methods  do  exactly  that.  Luettgen  and  Willsky  [8],  for  example, 
applied  a  scale-space  auto-regression  (AR)  model  to  texture  discrimination.  They  use  a  quadtree  or 
quadtree-like  organization  of  the  pixels  in  an  image  pyramid,  and  model  the  features  in  the  pyramid  as  a 
stochastic  process  from  coarse-to-fine  levels  along  the  tree.  The  variables  in  the  process  are  hidden,  and  the 
observations  are  sums  of  these  hidden  variables  plus  noise.  However,  the  MSP  model  distributions  are 
Gaussian,  i.e.,  the  joint  distribution  of  all  of  the  variables  is  a  Gaussian  distribution.  This  is  clearly  not  the 
case  in  natural  images,  such  as  mammograms.  Buccigrossi  and  Simoncelli  [3],  for  example,  have  found 
that  the  distributions  of  some  features  have  high  kurtosis,  and  that  the  distribution  of  one  feature 
conditioned  on  a  neighboring  feature  has  a  “bow-tie”  shape,  which  cannot  follow  from  a  Gaussian  joint 
distribution.  We  have  obtained  similar  results  using  our  HIP  model  (Figure  4).  The  MSP  approach  also 
models  the  probability  of  the  observations  on  the  tree,  not  the  probability  of  the  image. 
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Figure  4.  Empirical  (left)  and  modeled  (right)  conditional  histograms  of  image  feature  pairs. 

All  of  these  methods  seem  well-suited  for  modeling  texture,  but  it  is  unclear  how  we  might  build  the 
models  to  capture  the  appearance  of  more  structured  objects  or  objects  which  are  hybrid  in  nature  (e.g. 
which  include  both  structure  and  texture),  such  as  mammographic  masses.  We  can  argue  that  the  presence 
of  objects  in  images  can  make  local  conditioning  like  that  of  the  flexible  histogram  and  MSP  approaches 
inappropriate.  Objects  in  the  world  cause  correlations  and  non-local  dependencies  in  images  across 
different  resolutions.  For  example,  the  presence  of  a  particular  object  might  cause  a  certain  kind  of  texture 
to  be  visible  at  some  resolution.  Usually  the  local  image  structure  at  lower  resolutions  by  itself  will  not 
contain  enough  information  to  infer  the  object’s  presence,  but  the  entire  image  at  lower  resolutions  might. 
Therefore  the  probability  that  a  texture  is  present  will  depend  on  a  large  region  in  the  lower-resolution 
image. 

Similarly,  objects  create  long-range  spatial  dependencies  at  a  given  resolution.  For  example,  an  object  class 
might  result  in  a  kind  of  texture  across  a  large  area  of  the  image.  If  an  object  of  this  class  were  always 
present,  we  would  know  that  the  texture  is  present.  But  if  such  objects  are  not  always  present  and  cannot  be 
inferred  from  lower-resolution  information,  knowing  that  the  texture  is  present  at  one  location  tells  us  that 
it  is  present  elsewhere. 

These  considerations  imply  that  the  assumptions  of  the  flexible  histogram  and  MSP  approaches  limit  their 
capabilities.  The  features  at  one  resolution  and  one  location  depend  on  lower-resolution  image  information 
over  a  large  area  of  the  image,  and  even  given  that  information  they  depend  on  the  features  at  other 
locations  at  that  resolution. 

More  recently  others  have  developed  models  similar  to  our  HEP  model.  Crouse  et  al  [15]  developed  a  class 
of  models  they  called  Hidden  Markov  Trees  (HMTs)  that  differ  in  various  details  from  HIP  models,  but 
share  much  of  the  same  spirit.  They  tend  to  emphasize  the  high  kurtosis  of  the  marginal  distributions  of 
wavelet  coefficients,  which  they  mode  with  a  mixture  of  two  Gaussians.  Our  HEP  model  is  a  little  more 
general,  at  least  in  some  ways,  but  they  seem  to  have  been  quite  successful  in  applying  HMTs  to  several 
areas  such  as  image  denoising  and  segmentation.  Also  Cheng  and  Bouman  [16]  developed  similar  tree- 
structured  image  probability  models  for  segmentation  as  extensions  of  Bouman  and  Schapiro’s  work  on 
tree-structured  multi-resolution  segmentation  [17]. 

b.  HIP  architecture  and  variations 

Here  we  briefly  describe  HIP  models  and  discuss  variation  in  their  architectures.  A  HEP  model  is  built  on 
the  assumption  that  image  information  should  be  represented  at  various  length  scales  and  that  it  is  usually 
good  to  condition  fine  scale  information  on  coarser.  For  example,  given  an  image  of  some  object  at  some 
resolution,  if  we  can  identify  the  object  we  know  a  great  deal  about  its  likely  appearance  at  higher 
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resolution.  In  these  cases  conditioning  on  low  resolution  information  makes  the  higher  resolution 
information  less  dependent  at  different  locations.  Accordingly  we  decompose  the  image  into  some  multi¬ 
resolution  representation  such  as  a  wavelet  pyramid  decomposition,  like  the  of  Simoncelli  and  Adelson 
[18].  Any  invertible  decomposition  would  do,  since  we  are  then  just  expressing  the  distribution  in  a 
different  coordinate  system.  We  then  build  a  model  of  the  probability  distribution  of  each  pyramid  level  in 
the  decomposition,  conditioned  on  the  next  coarser  level,  if  there  is  one.  These  distributions  are  still  very 
complex  for  most  classes  of  images,  so  we  would  like  to  simplify  them  further  by  factoring  into  individual 
distributions  over  the  wavelet  coefficient  vectors  at  each  position.  By  itself  this  is  too  strong  an 
assumption,  but  it  is  necessarily  true  that  if  we  further  condition  on  appropriate  information  then  the 
distribution  does  factor,  i.e.,  the  coefficient  vector  at  a  position  is  conditionally  independent  of  the  vectors 
at  other  positions.  The  problem  is  finding  the  “appropriate  information.”  Accordingly  we  add  hidden 
variables  to  the  model,  build  the  model  so  that  the  model  distributions  of  coefficient  vectors  are 
conditionally  independent,  and  fit  this  to  the  data. 

Our  choice  for  hidden  variables  is  strongly  constrained  by  computational  constraints.  Accordingly  we 
made  two  choices.  First,  the  hidden  variable  at  a  position  in  a  level  conditions  the  wavelet  coefficient 
vector  at  that  position,  and  does  so  by  indexing  a  multivariate  Gaussian  distribution  for  that  vector.  Thus 
we  are  building  something  like  a  Gaussian  mixture  model  for  the  coefficient  vectors.  Dependence  between 
levels  is  introduced  by  giving  the  means  of  the  Gaussians  a  linear  dependence  on  the  parent  wavelet 
coefficients. 

Our  second  choice  is  to  give  a  tree  structure  to  the  hidden  variables.  That  is,  at  each  position  in  a  level  the 
hidden  variable  depends  only  on  a  hidden  variable  at  the  next  coarser  level  at  the  parent  position.  This 
position  is  determined  by  the  subsampling  operation  of  the  wavelet  decomposition.  These  parent-child 
relationships  then  determine  a  tree  or  quad-tree  like  structure.  Since  the  hidden  variables  are  indices  of 
mixture  components,  they  are  essentially  integers  in  some  range.  There  is  therefore  a  mixture  model  for  the 
coefficient  vector  at  a  position,  but  the  probabilities  of  the  components  are  determined  by  the  rest  of  the 
image.  The  same  mixture  components  are  used  at  all  locations  in  a  level,  but  we  are  free  to  use  different 
components  and  different  numbers  of  components  at  each  level. 

A  minor  but  important  elaboration  is  needed  to  model  spatial  patterns.  Originally  we  assumed  the 
distributions  at  each  child  position  of  a  parent  (upper  and  lower  left,  and  upper  and  lower  right,  for 
example)  were  the  same.  We  believe  it  is  better  to  allow  these  to  be  different,  so  that  a  parent  label  can 
indicate  particular  spatial  patterns,  such  as  an  edge  passing  through  the  upper  two  child  positions.  We  have 
kept  this  in  all  except  our  earliest  work. 

That  is  a  basic  description  of  our  earliest  HIP  models.  We  will  often  refer  to  the  hidden  indices  as  labels. 
These  labels  serve  a  dual  purpose:  first  to  determine  a  Gaussian  mixture  component,  and  second,  to 
determine  the  hidden  labels  at  finer  resolutions.  These  two  functions  are  somewhat  at  odds.  For  example, 
the  hidden  label  at  the  very  root  of  the  tree  may  need  to  have  a  large  number  of  possible  values  to  cover 
different  types  of  images  within  the  class  it  is  modeling.  However  there  is  only  one  wavelet  coefficient 
vector  per  image  at  the  root,  and  this  vector  may  have  several  dimensions,  requiring  quite  a  few  parameters 
per  mixture  component.  The  number  of  possible  values  of  the  hidden  label  is  limited  because  we  need 
sufficient  examples  per  label  value  to  fit  the  mixture  component’s  parameters.  We  have  addressed  this  by 
altering  the  model  to  have  two  hidden  labels  at  each  position.  One  hidden  label,  which  we  refer  to  as  the 
mixture  label,  serves  as  the  index  of  mixture  component  and  depends  only  on  the  other  hidden  label.  The 
second  hidden  label,  which  we  call  the  hierarchy  label,  conditions  only  the  local  mixture  label  and  the  child 
hierarchy  labels.  This  makes  it  possible  to  have  few  mixture  components  and  many  hierarchy  labels  at 
low-resolution  pyramid  levels. 

We  have  also  divided  the  mixture  labels  into  what  might  be  called  a  pattern  label,  though  we  continue  to 
refer  to  it  as  a  mixture  label,  and  a  scale  label.  This  is  intended  to  add  some  of  the  structure  that 
Wainwright  and  Simoncelli  found  useful  in  modeling  wavelet  coefficient  statistics  [19].  They  found  that 
much  of  the  statistics  between  pairs  of  wavelet  coefficients  at  different  positions,  orientations,  and/or  scales 
can  be  fit  by  a  model  with  Gaussian  distributions  for  the  coefficients  and  a  hidden  scale  at  each  position, 
conditioned  in  a  tree  structure  like  the  hidden  labels  in  HIP.  Their  scale  variables  are  continuous,  however. 
In  our  case  the  pattern  label  indicates  the  mean,  covariance  matrix  and  correlation  matrix  that  should  be 
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used,  up  to  a  scale.  The  scale  label  indicates  a  factor  by  which  the  mean,  covariance  and  parent  correlation 
will  be  scaled. 

There  are  many  possible  variations  on  these  divisions  of  the  hidden  labels  into  different  parts.  All  of  them 
basically  add  substructure  to  the  labels  and  their  dependencies.  A  version  we  have  recently  used  has  the 
mixture,  scale  and  hierarchy  labels.  The  mixture  and  scale  labels  only  condition  the  Gaussian  distribution 
of  the  local  coefficient  vector,  as  described  above.  The  hierarchy  label  conditions  the  mixture,  scale  and 
hierarchy  coefficients  at  the  child  positions,  not  at  the  same  level.  In  this  way  somewhat  different  functions 
are  separated:  the  local  image  structure  at  a  given  scale  and  position  is  influenced  by  one  pair  of  hidden 
variables,  while  a  separate  hidden  variable  is  used  to  influence  the  spatial  arrangement  of  such  structures  at 
finer  resolutions. 

c.  Architecture  Search 

Information  theory  gives  us  criteria  by  which  to  judge  models,  but  this  alone  does  not  tell  us  how  to  select 
models  to  compare,  i.e.,  what  path  through  the  space  of  possible  models  we  should  take  to  find  the  best 
model.  For  the  HIP  model  the  search  is  over  vectors  of  natural  numbers,  so  there  is  no  gradient  to  follow  to 
find  even  a  local  minimum. 

Initially  we  used  PMDL  as  the  search  criterion  along  with  a  pseudo-gradient  descent  algorithm.  In  this 
algorithm  we  computed  how  the  PMDL  cost  would  change  when  the  number  of  labels  at  one  level  was 
changed.  We  then  searched  along  the  vector  over  levels  of  these  changes  for  a  minimum.  This  search 
algorithm  is  sub-optimal.  It  can  get  stuck  in  local  minima,  and  in  fact  one  might  say  it  can  get  stuck  at 
many  points  that  are  not  even  local  minima,  since  the  points  at  which  the  algorithm  exits  are  only  better 
than  those  neighbors  that  differ  in  one  component  of  the  architecture  vector.  If  changes  in  more  than  one 
component  are  allowed,  many  of  these  final  points  are  probably  not  local  minima.  Unfortunately,  the 
number  of  neighbors  that  differ  in  more  than  one  component  is  very  large,  and  since  training  one 
architecture  already  takes  several  hours  on  a  Sun  Ultra  60  workstation,  this  more  complete  search  takes  a 
prohibitively  long  time. 

Furthermore  we  frequently  find  architectures  and  layers  for  which  increasing  and  decreasing  the  number  of 
components  both  give  a  decrease  in  the  cost,  yet  we  only  search  in  one  of  the  two  directions,  thus  probably 
missing  better  local  minima  part  of  the  time. 

One  alternative  to  these  heuristic  approaches  is  exhaustive  search,  at  least  in  a  bounded  region  of  the  search 
space.  This  is  optimal  but  very  expensive.  Unfortunately  the  unknown  behavior  of  means  there  are  no 
better  guaranteed-optimal  methods. 

It  may  be  possible  to  develop  a  split-and-merge  algorithm  like  that  of  Ueda,  et  al  [13].  Such  an  algorithm 
would  analyze  the  data  conditioned  on  the  model  parameters,  and  attempt  to  decide  which  labels  in  the 
model  could  be  merged  and  which  could  be  split.  In  this  way  a  new  architecture  is  always  initialized  at  a 
relatively  good  fit  to  the  data,  rather  than  with  random  starting  values.  This  precludes  using  PMDL  to  judge 
whether  a  particular  split-and-merge  operation  improved  the  architecture,  so  we  would  have  to  use  a 
conceptually  more  straightforward  estimate  of  code  length.  We  have  spent  a  little  time  investigating  split 
and  merge  criteria,  but  with  no  firm  conclusions  yet. 

In  much  of  our  work  we  tried  intuitive  approaches.  In  one  such  approach  we  used  a  HIP  architecture  in 
which  the  mixture  labels  only  condition  the  Gaussians,  i.e.,  the  hierarchy  labels  transmit  all  of  the 
information  between  levels.  We  began  with  one  hierarchy  label  per  level,  so  that  they  passed  no 
information.  In  effect  the  HIP  model  was  a  set  of  independent  mixture  models,  one  for  each  level.  We 
began  with  one  mixture  component  and  successively  split  and  retrained  the  components,  stopping  when  the 
AIC  or  MDL  cost  began  to  increase,  or  when  we  decided  not  to  spend  more  computer  time.  We  then  added 
hierarchy  components,  successively  splitting  these  and  retraining,  stopping  according  to  AIC  or  MDL.  In 
effect  we  build  as  good  a  model  of  the  distribution  of  coefficient  vectors  as  the  data  allows,  and  then  use 
the  hierarchy  components  to  learn  spatial  relations  between  the  mixture  components.  This  tended  to  result 
in  very  many  mixture  components  and  few  hierarchy  components.  In  effect  it  spends  most  of  the 
parameters  modeling  the  marginal  distributions  of  the  coefficient  vectors. 
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In  a  second  approach,  we  first  only  split  hierarchy  labels  (with  a  small  number  of  mixture  labels), 
effectively  spending  parameters  on  spatial  relationships  between  the  mixture  components.  A  third 
approach  is  to  alternate  between  splitting  the  mixture  and  hierarchy  labels.  All  of  these  seem  to  be 
workable,  though  the  approach  of  splitting  mixture  labels  first  seems  to  suppress  the  usefulness  of 
hierarchy  labels.  The  biggest  problem  is  lack  of  computer  time.  Though  we  have  tried  to  optimize  the  HIP 
training  programs  for  speed,  these  search  algorithms  result  in  very  large  models  that  consume  a  great  deal 
of  memory  and  training  time.  In  fact  we  have  not  yet  reached  a  minimum  of  the  AIC  cost. 

In  summary  the  problem  of  architecture  search  for  HIP  models  remains  open.  We  have  some  heuristic 
approaches  using  information  theoretic  cost  criteria  that  may  be  adequate.  Our  chief  difficulties  are  the 
computational  resources  needed  for  the  large  HIP  models. 

d.  Choice  of  image  representation 

In  our  early  investigations  of  image  synthesis  with  HIP  models  we  noticed  that  our  filter  sets  tended  to 
suppress  high  frequencies.  This  means  that  the  inverse  transformation  (reconstructing  an  image  from  the 
filtered  and  sub-sampled  images)  must  boost  these  frequencies.  In  extreme  cases  this  will  result  in  ringing. 
In  less  severe  cases  there  is  a  tendency  toward  “blockiness”.  This  appears  if  we  generate  a  set  of  white 
noise  images,  and  then  construct  the  original  image  that  would  have  given  these  white  noise  images  as 
feature  images.  That  is,  we  assume  the  white  noise  images  are  feature  images  and  reconstruct  the 
corresponding  original  image.  With  our  previous  features  this  tended  to  give  sharp  horizontal  and  vertical 
edges  that  nearly  group  into  squares. 

The  HIP  model  can  partially  eliminate  this  blockiness  because  it  captures  correlations  between  features  at 
neighboring  levels.  Ignoring  these  correlations  gives  increased  blockiness.  While  it  is  good  that  the  HIP 
model  can  learn  to  eliminate  artifacts  such  as  blockiness,  it  is  not  a  good  use  of  the  HEP  model’s  resources 
since  these  artifacts  are  introduced  because  of  the  choice  of  features.  We  would  prefer  to  have  features  that 
do  not  have  such  artifacts,  so  the  resources  of  the  HIP  model  can  be  devoted  to  learning  other  structures. 

We  studied  this  problem  by  trying  to  choose  better  sets  of  features  or  wavelets.  We  have  designed  even-tap 
wavelets  for  subsampling  by  two,  and  odd-tap  wavelets  for  subsampling  by  three.  (Curiously,  when 
subsampling  by  two  the  resulting  wavelets  are  only  approximately  orthonormal,  but  for  subsampling  by 
three  there  are  exactly  orthonormal  wavelets.)  We  concluded  that  these  minimize  the  tendency  for 
blockiness,  but  splitting  scales  into  discrete  bands,  i.e.,  pyramid  levels,  inevitably  introduces  this  tendency, 
since  a  flat  power  spectrum  at  each  level  corresponds  to  a  stepped  power  spectrum  of  the  corresponding 
image.  Probably  we  could  further  reduce  blockiness  through  the  use  of  overcomplete  representations  like 
Freeman  and  Adelson’s  steerable  pyramids  [20].  However  HIP  would  no  longer  be  a  model  of  the  image 
distribution,  at  least  not  obviously.  Instead  it  is  a  model  of  the  distribution  of  feature  pyramids.  For 
classification  problems  this  may  not  matter. 


2.  Experiments 

a.  Mass  Ciassification 

We  applied  HIP  to  the  problem  of  mass  detection  in  mammographic  CAD.  We  used  the  same  data  that  was 
used  to  evaluate  the  HPNN  (72  true  positive  and  100  false  positive  ROIs  taken  from  the  UofC  CAD 
system).  For  this  particular  model  =  0.79.  A  comparison  between  the  HIP  and  HPNN  performance  on 
the  same  data  is  shown  in  Table  1.  Though  the  HEP  model’s  performance  on  the  test  data  was  not  as  high 
as  the  HPNN,  our  efforts  at  model  selection  were  limited  by  the  long  training  time  and  high  memory  cost  of 
the  complex  HEP  models  that  perform  better.  Thus  we  hope  that  HEP  models  can  perform  as  well  given  the 
same  amount  of  data,  but  they  are  more  costly  to  train.  This  is  perhaps  inevitable  since  estimating  the 
image  distribution  is  a  harder  task  than  estimating  the  conditional  class  probability. 

Using  this  new  hidden  label  architecture,  the  best  HIP  model  pair,  as  defined  by  the  AIC  cost  (see  below), 
gives  Az  =  0.78,  and  has  the  ROC  curve  shown  in  Figure  5.  In  this  case  we  have  eliminated  30%  of  the 
false  positives  of  the  UofC  CAD  system  for  mass  detection,  without  loss  in  sensitivity. 
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Sensitivity 

Fine-to  Coarse  HPNN 

Specificity 

HIP 

Specificity 

100% 

51% 

25% 

95 

57 

36 

90 

67 

52 

80 

79 

75 

Table  1.  Specificity  vs.  sensitivity  for  the  HPNN  and  HIP  mass  detectors. 


Figure  5.  ROC  curve  of  best  HIP  model,  chosen  using  AIC.  Results  are  relative  to  UofC  CAD  system 
for  mass  detection. 


b.  Novelty  Detection 

Novelty  detection  identifies  examples  that  are  significantly  different  from  the  examples  on  which  the 
model(s)  was  trained  [23].  Detecting  novel  examples  can  be  useful  in  a  CAD  system  for  generating 
confidence  measures  on  the  CAD  output  and  identifying  data  that  could  be  used  in  future  training  of  the 
neural  network/statistical  model.  The  HIP  model’s  generative  structure  enables  novel  examples  to  be 
identified  by  thresholding  the  log-likelihood  of  the  models.  Figure  6  illustrates  how  ROC  performance 
improves  if  novelty  detection  is  used  to  generate  a  confidence  measure  for  rejecting  low-confidence 
examples.  In  this  example,  two  HIP  models  were  trained,  one  for  positive  ROIs  and  one  for  negatives 
ROIs  (same  ROI  database  as  for  classification  and  synthesis).  Test  data  was  evaluated  by  computing  the 
likelihood  ratio  of  the  models  as  well  as  the  absolute  value  of  the  log-likelihoods.  The  absolute  values  of 
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the  log-likelihoods  are  thresholded  such  that  low  values  are  considered  low  confidence  and  therefore 
rejected  (not  classified).  As  the  threshold  on  the  log-likelihood  is  increased,  more  ROIs  are  rejected 
because  of  low  confidence  and  the  area  under  the  ROC  curve  begins  to  increase.  Also  shown  in  Figure  6 
are  data  that  are  rejected  (not  classified)  because  they  fall  below  the  threshold  at  the  given  rejection  rate — 
these  ROIs  are  novel  with  respect  to  the  data  on  which  the  models  were  trained. 


Effect  of  rejecting  low-probablllty  examples  on  performance 

1 - 1 - i - 1 - 1 - 1 - 1 - 1 - r 

pos  pos 


0  0.1  0.2  0,3  0,4  0.5  0.6  0.7  0.8  0.9 

Fraction  rejected 


Figure  6:  Novelty  detection  for  improving  ROC  performance.  The  log-likelihood  of  the  two  HIP 
models  (positive  and  negative)  can  be  thresholded  so  that  we  reject  (do  not  classify)  a  fraction  of  the 
test  data  that  is  novel,  relative  to  the  training  examples.  Shown  is  the  area  under  the  ROC  curve  as 
this  novelty/conOdence  threshold  is  increased  (thus  increasing  the  fraction  rejected).  Also  shown  are 
examples  of  negative  and  positive  ROIs  that  would  be  rejected  at  different  thresholds. 

c.  Wavelet  Bases 

In  the  usual  wavelet  decomposition,  basis  filters  are  applied  to  the  image  and  the  results  are  subsampled, 
giving  four  images  or  subbands,  each  with  one-fourth  the  number  of  pixels  of  the  original  image.  This 
procedure  is  successively  applied  to  the  low-pass  subbands  to  build  the  wavelet  pyramid.  Applying  the 
decomposition  to  the  low-pass  subbands  is  a  choice;  it  reflects  the  property  of  typical  images  that  only  the 
low-frequency  components  are  correlated  across  long  distances.  In  some  images  the  higher  frequencies  can 
also  be  correlated  over  long  distances.  Ordered  textures  like  fabrics  are  good  examples  of  this.  Saito  has 
suggested  a  procedure  for  choosing  a  different  wavelet  decomposition  to  exploit  such  image  structure  [21]. 

If  every  subband  is  decomposed,  the  resulting  set  of  subbands  forms  a  tree  with  edges  from  each  subband 
to  those  subbands  that  result  from  decomposition.  The  leaves  of  the  tree  are  single-pixel  images  with  very 
specific  frequency  content.  There  are  as  many  leaves  as  pixels  in  the  original  image.  The  collection  of 
such  subbands  is  overcomplete;  at  any  node  in  the  tree  the  subband  contains  all  the  information  in  any  set 
of  its  descendents.  A  complete  set  of  subbands  is  any  set  that  contains  one  and  only  one  subband  on  any 
path  from  the  root  of  the  tree  to  a  leaf.  Saito  called  this  a  wavelet  packet  basis. 

Saito  suggested  using  an  entropy  measure  to  choose  between  these  complete  sets  of  subbands.  This 
entropy  is  obtained  by  normalizing  the  sum  of  the  squares  of  the  pixels  in  the  image  before  decomposing  it. 
Because  the  wavelet  transform  is  orthonormal,  the  sum  of  the  squares  of  the  pixels  in  any  complete  set  of 
subbands  will  also  be  equal  to  one.  Call  the  square  of  a  pixel  e  (for  energy),  the  entropy  is  defined  as 
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(6) 


H  —  Yj  ^b,x  ’ 

b,x 

where  the  sum  is  over  subbands  b  and  positions  x.  This  is  a  measure  of  compactness  or  sparseness.  It  is 
maximized  if  every  pixel  in  every  subband  has  the  same  energy,  and  minimized  (equal  to  zero)  if  only  one 
pixel  in  one  subband  has  non-zero  energy. 

Given  the  tree  structure  of  the  wavelet  decomposition,  Saito  searches  each  branch  recursively,  comparing 
the  partial  entropy  of  a  node  with  the  minimal  energy  of  all  complete  sets  of  the  node’s  children.  This  is 
tractable  because  the  partial  entropy  of  a  node  is  independent  of  the  partial  entropy  in  other  branches  of  the 
tree. 

We  have  experimented  with  this  approach  to  choosing  a  wavelet  packet  basis,  using  several  image  classes. 
It  appears  to  be  useful  for  highly  structured  images  like  some  textures,  as  suggested  above.  Mammograms, 
on  the  other  hand,  seem  to  lack  sufficient  structure  to  make  this  as  useful,  but  the  technique  has  shown  that 
we  can  usefully  decompose  all  of  the  level-one  subbands  (level  zero  being  the  original  image),  giving  a 
wavelet  packet  basis  with  relatively  high  dimensional  level  two  coefficient  vectors.  The  advantage  is  some 
memory  savings  in  the  HIP  model,  since  the  highest  resolution  level  is  then  smaller,  and  the  memory 
needed  for  HIP’s  internal  variables  during  training  is  reduced. 

of.  Mass  Synthesis 

Since  the  HEP  model  is  a  generative  model,  we  can  sample  the  model  and  synthesize  new  images.  In 
practice,  this  property  might  be  best  utilized  for  image  compression  or  noise  reduction.  Within  the  context 
of  ROI  classification,  synthesized  images  can  give  us  insight  into  what  features  the  model  is  extracting  and 
representing  for  both  positive  and  negative  ROIs,  Using  the  same  ROI  database  used  for  classification,  we 
constructed  HIP  models  for  positives  (cancer)  and  negatives  (no  cancer).  The  trained  HIP  models  were 
sampled  to  synthesize  new  ROI  images.  Figure  7  shows  examples  of  these  images.  Inspection  of  the 
synthesized  positive  ROIs  shows  more  focal  structure,  with  more  well-defined  borders  and  higher  spatial 
frequency  content  than  the  negative  ROIs. 

The  sampling  procedure  begins  at  the  coarsest  resolution,  where  the  hidden  labels  are  randomly  sampled 
from  the  distribution  P{Ai).  The  feature  (wavelet  coefficient)  images  are  then  sampled  from 
P(G I  \Ai),  The  are  used  to  construct  Ii_i ,  from  which  the  parent  feature  (wavelet  coefficient) 
images  are  constructed.  We  then  sample  Ai_i  from  P{Ai^i  |A^),  and  then  G^_i  from 
P(G  I  ¥i ,  Ai^i ) .  This  is  repeated  until  the  finest  resolution  is  reached  and  Iq  is  constructed. 


Figure  7:  Mammographic  ROI  images  synthesized  from  positive  (left)  and  negative  (right)  HIP 
models.  Synthesized  positive  ROIs  tend  to  have  more  focal  structure,  with  more  defined  borders  and 
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higher  spatial  frequency  content.  Negative  ROIs  tend  to  he  more  amorphous  with  lower  spatial 
frequency  content. 


e.  Mass  Compression 

A  stream  of  random  variables  can  be  optimally  compressed  if  we  know  their  distribution,  and  so  having  a 
HIP  model  of  a  source  of  images  should  allow  us  to  compress  examples  of  those  images  with  high 
efficiency.  Here  we  demonstrate  compression  with  HIP  models  using  a  simple  technique. 

Given  an  image  and  a  HIP  model,  we  compute  the  most  likely  value  of  each  hidden  label, 

ai  (jc)  =  argmax  P{ai  ,x,I) .  (7) 

aiM 


We  then  code  each  feature  vector  g/  (jc)  using  P(g;  |  f/+i ,  a/ ,  a:)  .  The  latter  is  used  by  decomposing 


g/  (jc)  into  its  components  along  the  eigenvectors  of  the  covariance  matrix  of  P(g/  |  f/+i ,  cz/ ,  jc)  ,  E 


and  coding  those  components  with  a  specified  precision  using  Huffman  encoders  for  the  Gaussian 
distributions  with  variances  given  by  the  eigenvalues  of  .  The  resulting  bitstream  was  stored  in  a  file 

that  was  susbequently  compressed  with  gzip  to  reduce  the  redundancy  in  the  many  short  identical  bit 
patterns.  This  procedure  is  currently  very  computationally  expensive,  and  is  not  necessarily  optimal  even  if 
the  HIP  model  exactly  matches  the  image  distribution,  but  it  is  straightforward  to  code  and  serves  to 
demonstrate  the  capability. 


Error  Vs.  File  Size  Max.  Error  Vs.  File  Size 


Figure  8.  Error  vs.  file  size  for  HIP  compression  algorithm  and  JPEG.  The  left  plot  shows  root 
mean-squared  error,  while  the  right  plot  is  maximum  error. 
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Figure  9.  Detail  of  compressed  images.  Left  is  a  piece  of  the  original  image,  center  is  the 
corresponding  piece  of  the  JPEG-compressed  image,  and  right  is  the  corresponding  piece  of  the  HIP 
compressed  image.  The  JPEG  and  HIP  compression  parameters  were  chosen  to  give  obvious 
distortion  and  nearly  equal  file  sizes. 

Figure  8  shows  the  root-mean-squared  and  maximum  errors  versus  the  size  of  the  resulting  compressed  file, 
respectively.  This  is  for  one  randomly-chosen  mass  ROI  image,  which  was  not  part  of  the  training  set  of 
the  HIP  model.  The  HIP  algorithm  gives  mean  errors  that  are  comparable  to  JPEG,  and  these  limited  results 
suggest  that  its  maximum  errors  are  a  little  lower  than  with  JPEG.  It  is  perhaps  not  surprising,  since  the 
HIP  model  was  fit  to  similar  data  while  JPEG  is  intended  to  be  general,  but  it  demonstrates  the  potential. 
Compressed  and  uncompressed  images  are  shown  in  Figure  9. 

f.  Microcalcification  Classification 

We  also  spent  some  time  applying  HIP  models  to  the  problem  of  classifying  ROIs  as  microcalcification 
clusters  or  false  positives.  This  differs  somewhat  from  our  previous  approaches,  in  which  we  attempted  to 
detect  the  individual  calcifications,  then  classify  the  ROI  based  on  the  number  of  calcifications.  The  data 
was  provided  by  Prof.  Robert  Nishikawa  at  the  University  of  Chicago.  We  trained  the  positive  model  on 
42  ROIs  chosen  at  random  from  the  positives,  and  the  negative  model  on  88  negatives  chosen  at  random 
from  the  full  set  of  negatives.  This  left  42  positives  and  87  negatives  for  testing.  Again,  while  searching 
for  the  best  HIP  architecture  we  ran  out  of  computer  memory  and  training  became  very  slow,  due  to  the 
complexity  of  the  model,  before  the  AIC  cost  reached  a  minimum.  The  resulting  ROC  curve  on  the  test 
data  had  A^=0.68. 

g.  Microcalcification  synthesis 

Applying  the  trained  models  to  synthesizing  microcalcification  cluster  ROIs  gave  results  like  those  shown 
in  Figures  10  and  11.  Figure  10  shows  images  generated  by  HIP  models  chosen  by  splitting  mixture 
components  first,  while  the  images  in  Figure  1 1  were  generated  by  models  chosen  by  splitting  the  hierarchy 
components  first.  Note  that  all  of  these  images  were  generated  with  the  same  underlying  stream  of  random 
numbers. 

The  isolated  blobs  in  Figure  10,  especially  in  the  negative  image,  seem  to  be  due  to  the  poor  ability  to 
represent  spatial  patterns.  Otherwise,  in  both  cases  the  negatives  seem  to  be  somewhat  smoother,  but  that 
was  not  always  the  case  for  other  images  generated  by  the  models. 
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Figure  10.  Synthetic  microcalcification  ROIs  generated  by  HIP  models.  The  left  image  is  from  the 
positive  model,  while  the  right  is  from  the  negative  model.  These  were  generated  by  HIP  models  with 
many  mixture  components  but  little  information  passed  between  levels  by  the  hidden  labels. 


Figure  11.  Synthetic  microcalcification  ROIs  generated  by  HIP  models.  The  left  image  is  from  the 
positive  model,  while  the  right  is  from  the  negative  model.  By  contrast  with  the  previous  figure,  these 
were  generated  by  HIP  models  with  few  mixture  components  but  many  hierarchy  labels,  i.e.,  a  great 
deal  of  information  is  passed  between  levels  by  the  hidden  labels. 

3.  HIP  Conclusions 

We  have  developed  a  class  of  image  probability  models  we  call  hierarchical  image  probability  or  HIP 
models.  We  showed  that  image  distributions  can  be  exactly  represented  as  products  over  pyramid  levels  of 
distributions  of  sub-sampled  feature  images  conditioned  on  coarser-scale  image  information.  We  argued 
that  hidden  variables  are  needed  to  capture  long-range  dependencies  while  allowing  us  to  further  factor  the 
distributions  over  position.  In  our  current  model  some  of  the  hidden  variables  act  as  indices  of  mixture 
components  while  others  condition  these  mixture  component  indices  and  carry  information  from  one  level 
to  the  next  higher  resolution  level.  The  resulting  model  is  very  similar  to  the  Hidden  Markov  Tree  models, 
but  allows  modeling  somewhat  more  general  image  structures.  Because  they  are  models  of  probability 
distributions  over  images,  these  kinds  of  models  can  be  used  for  a  wide  range  of  image  processing  tasks 
besides  classification,  e.g.,  compression,  noise-suppression,  up-sampling,  error  correction,  etc.  Here  we 
presented  results  for  mammographic  image  analysis,  including  classification,  synthesis,  and  compression. 
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However  there  are  obviously  other  modalities  and  medical  application  areas  were  HEP  models  would  be 
useful.  One  in  particular  is  multi-modal  fusion,  where  the  problem  is  to  bring  a  set  of  images,  acquired 
using  different  imaging  modalities,  into  alignment.  One  method  that  has  demonstrated  particularly  good 
performance  uses  mutual  information  as  an  objective  criterion  [22].  The  computation  of  mutual 
information  requires  an  estimate  of  entropies,  which  in  turn  requires  an  estimate  of  the  underlying  densities 
of  the  images.  The  HIP  model  potentially  provides  a  framework  for  learning  those  densities. 

For  classification  of  masses  and  microcalcifications  the  HIP  models  have  not  been  as  good  as  our  HPNN 
neural  network  classifiers.  The  cause  seems  to  be  that  the  HIP  model  is  learning  a  much  more  complex 
task,  density  estimation  on  image  space,  as  opposed  to  the  simpler  task  of  estimating  class  probability  from 
the  image.  Thus  the  HIP  models  are  much  more  complex.  The  information  theoretic  criteria  for  choosing 
between  models  of  different  complexity  seem  to  be  useful,  but  we  have  repeatedly  found  that  more 
complex  models  would  give  better  information  theoretic  cost,  when  we  could  not  train  those  more  complex 
models  due  to  limited  computer  memory  and  speed.  Thus  there  is  room  for  improvement  here  in  the  future. 
Another  path  toward  better  performance  is  to  modify  the  HIP  models.  The  tree  structure  of  the  hidden 
variables  is  far  from  optimal,  correlating  some  neighbors  while  leaving  others  much  more  independent. 

The  only  reason  for  using  trees  over  positions  and  pyramid  levels  is  computational  tractability.  Modifying 
the  tree  structure  will  require  approximate  methods,  but  may  well  be  worth  the  difficulties. 

III.  Key  Research  Accomplishments 

1.  Application  of  hierarchical  pyramid  neural  network  (HPNN)  to  mammo graphic  mass  detection. 

Results  show  a  51%  reduction  in  false  positive  rate  of  The  UofC  CAD  system  for  mass  detection 
without  loss  in  sensitivity. 

2.  Development  of  the  hierarchical  image  probability  (HIP)  model  for  mammographic  CAD.  HIP  is  a 
generative  model  that  allows  for  computing  confidence  measures  based  on  the  training  data-an 
element  that  is  often  absent  from  CAD  systems.  More  importantly,  its  structure  is  well-suited  for 
application  of  MDL  model  selection  techniques. 

3.  We  have  developed  search  strategies  and  algorithms  for  selecting  a  HIP  architecture  using  MDL  and 
AIC  information  theoretic  criteria. 

4.  HIP  models  selected  by  information  theoretic  criteria  for  mass  detection  reduced  the  false  positive  rate 
of  the  UofC  CAD  system  for  mass  detection  by  30%  without  loss  in  sensitivity.  We  also  applied  HEP 
models  to  the  detection  of  microcalcification  clusters. 

5.  We  showed  that  selecting  wavelet  packet  bases  using  an  information  theoretic  criterion  (entropy)  gives 
an  image  representation  that  allows  a  more  computationally  efficient  HEP  architecture.  In  particular 
the  model  requires  much  less  memory. 

6.  We  have  shown  that  different  information  theoretic  measures  track  the  HIP  generalization  performance 
and  thus  offer  good  criteria  for  model  selection. 

7.  We  have  demonstrated  the  utility  of  the  HIP  architecture  for  identifying  novel  ROIs.  This  novelty 
detection  is  useful  for  defining  confidence  measures  for  the  classifier. 

8.  We  have  demonstrated  the  utility  of  the  HEP  architecture  for  synthesizing  new  positive  and  negative 
mammographic  ROIs.  We  have  discussed  how  synthesis  can  be  used  to  gain  an  intuitive 
understanding  of  the  structure  that  is  captured  by  the  model, 

9.  We  have  demonstrated  the  utility  of  the  HEP  architecture  for  image  compression. 

IV.  Reportable  Outcomes 

1.  Disclosure/Patent  Application  “Hierarchical  Image  Probability  Models”,  March  1999. 
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2.  Clay  Spence,  Lucas  Parra  and  Paul  Sajda,  “Mammographic  mass  detection  with  a  hierarchical  image 
probability  (HIP)  model”,  in  Medical  Imaging  2000:  Image  Processings  Kenneth  M.  Hanson,  Editor, 
Proceedings  of  SPIE  Vol.  3979,  pp.  990-997  (2000). 

3.  Presentation  “Hierarchical  Pattern  Recognition  for  Mammographic  CAD”,  University  of  Pennsylvania, 
November  1998. 

4.  Invited  talk  at  Columbia  University  Medical  School  “Hierarchical  Neural  Networks  for  Object 
Recognition:  Applications  to  Mammographic  Computer-aided  Diagnosis”,  June  2000 

5.  DoD  Era  of  Hope  meeting  (poster),  “A  Hierarchical  Image  Probability  Model  for  Mammographic 
Mass  Detection”,  June  2000 

6.  Invited  lecture  at  The  University  Of  Pennsylvania,  Department  of  Bioengineering  “Computer  Assisted 
Diagnosis  for  Mammography”,  November  1999 

7.  NIMA/DARPA  Medical  Dual-use  project  ($1.8M).  Focus  on  developing  dual-use  technology  for 
medical  and  military  applications.  Medical  areas  include  breast  cancer,  lung  cancer,  retinal  disease  and 
neurological  disease. 

8.  Clay  Spence,  Lucas  Parra  and  Paul  Sajda,  “Hierarchical  Image  Probability  (HIP)  Models.”  In  the 
Proceedings  of  ICIP  2000,  the  IEEE  International  Conference  on  Image  Processing. 

9.  Clay  Spence,  Lucas  Parra  and  Paul  Sajda,  “Detection,  synthesis  and  compression  in  mammographic 
image  analysis  using  a  Hierarchical  Image  Probability  (HIP)  model.”  Submitted  to  MMBIA  2001,  the 
IEEE  Workshop  on  Mathematical  Methods  in  Biomedical  Image  Analysis. 

10.  Paul  Sajda  and  Clay  Spence,  “Learning  Contextual  Relationships  in  Mammograms  using  a 
Hierarchical  Pyramid  Neural  Network.”  Submitted  to  IEEE  Trans.  Medical  Imaging.  (Accepted 
subject  to  minor  revisions.) 


V.  Conclusion 


We  have  demonstrated  the  utility  of  hierarchical  pattern  recognizers  for  improving  the  performance  of 
CAD  systems  for  mass  detection.  Mass  detection  is  currently  the  more  difficult  problem  in  mammographic 
CAD  (compared  to  microcalcification  detection).  For  CAD  systems  to  gain  clinical  acceptance,  false 
positives  must  be  significantly  reduced  without  loss  in  sensitivity.  On  a  small  research  database,  application 
of  our  HPNN  model  has  resulted  in  a  5 1%  reduction  in  false  positive  rate  of  the  UofC  CAD  system  for 
mass  detection.  However  the  HPNN  models  that  we  have  trained  are  not  well-suited  to  objective  model 
selection  techniques,  such  as  MDL.  Since  objective  model  selection  is  often  critical  to  maximizing  the 
performance  of  a  pattern  recognizer,  we  developed  a  new  hierarchical  pattern  recognition  framework  that 
we  call  the  hierarchical  image  probability  (HIP)  model. 

We  have  developed  search  strategies  for  applying  information  theoretic  criteria  to  the  problem  of  selecting 
the  best  label  architecture  for  a  HIP  model.  Furthermore,  we  demonstrated  that  these  criteria  correlate  well 
with  generalization  performance  of  the  classifiers. 

Our  results  show  that  HIP  models  selected  using  these  criteria  can  reduce  false  positive  rates  by  30%  for  a 
data  set  constructed  using  The  University  of  Chicago  CAD  mass  detection  system.  It  appears  that  further 
improvements  are  likely  simply  by  using  more  complex  HIP  models. 

We  have  used  the  generative  structure  of  the  HIP  model  to  detect  novel  examples — examples  that 
significantly  differ  from  the  training  data.  Novelty  detection  can  be  used  to  generate  confidence  measures 
and  we  have  shown  how  these  confidence  measures  can  be  used  to  improve  ROC  performance.  In  practice 
such  examples  would  be  rejected  as  not  reliably  classifiable  by  the  models.  This  capability  is  not  shared 
with  classifiers  that  directly  estimate  the  probability  of  the  class  given  the  image. 
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We  have  demonstrated  other  useful  aspects  of  the  generative  feature  of  HEP  models.  We  have  sampled 
positive  and  negative  HIP  models  for  synthesizing  ROIs,  enabling  us  to  gain  an  intuition  into  the  structure 
the  HIP  model  learns  for  representing  the  two  classes.  We  also  developed  a  simple  algorithm  for 
compressing  images  using  HIP  models,  and  showed  that  it  performs  a  little  better  than  JPEG  on 
manunographic  regions  of  interest. 


A.  “So  What”  Section 

Statistical  pattern  recognition  is  a  key  element  in  any  mammographic  computer-aided  diagnosis  system. 
Hierarchical  pattern  recognizers  are  particularly  useful  since  they  are  capable  of  exploiting  contextual  and 
multi-resolution  information  for  detecting  clinically  significant  objects.  Most  statistical  pattern  recognizers 
that  have  been  previously  developed  for  mammographic  CAD  have  been  trained  to  estimate  the  probability 
of  the  class,  e.g.,  mass  or  non-mass,  given  the  image  or  some  features  extracted  from  the  image.  By 
contrast  HIP  models  are  trained  to  estimate  the  probability  distribution  of  images.  This  gives  HEP  models 
many  attractive  features.  One  could  use  HIP  for  detection/classification  in  the  usual  way  by  training  a 
distribution  for  each  object  class  and  using  Bayes’  rule  to  get  the  class  probability.  We  have  reported  our 
initial  results  for  this  application  of  HIP  in  this  report. 

Even  though  our  original  motivation  for  this  model  was  to  develop  a  framework  for  hierarchical  pattern 
recognition  which  could  exploit  techniques  in  MDL  model  selection,  there  are  other  attractive  features  of 
the  HEP  framework  which  could  have  a  major  impact  on  the  design  and  development  of  mammographic 
CAD  systems.  Since  HIP  computes  the  probability  density  at  the  input  image,  we  could  attempt  to  detect 
unusual  images  and  reject  them  rather  than  trust  the  classifier;  something  that  is  not  possible  with  models  of 
the  class  probability.  Building  confidence  measures  into  CAD  systems  is  an  open  area  of  research  and  the 
HIP  model  provides  a  mechanism  by  which  to  generate  these  measures.  In  fact  we  have  shown  results 
illustrating  how  novelty  detection  can  be  used  to  improve  the  ROC  performance  of  CAD  systems. 

The  HIP  model  has  applications  other  than  detection/classification,  and  can  be  used  in  these  applications 
without  further  training.  Since  the  HIP  model  is  a  generative  model,  one  can  use  it  to  compress  data,  given 
the  probability  distribution  of  the  objects  of  interest.  If  one  wants  lossless  compression  of  a  digital 
mammogram  one  need  only  train  a  HIP  model  for  a  set  of  mammographic  images  and  then  use  the 
probability  model  to  compress  the  data.  More  interesting  is  the  application  of  HIP  for  lossy  compression.  In 
that  case,  one  might  train  a  HIP  model  on  clinically  significant  objects,  such  as  mammographic  masses, 
since  those  are  the  parts  of  the  image  one  would  like  to  preserve-i.e.  have  minimal  distortion  and 
compression  artifacts.  The  entire  image  can  then  be  compressed  using  this  model.  Though  there  will  be  loss 
over  regions  of  the  mammogram  which  do  not  fit  the  model,  those  regions  of  clinical  significance  will  be 
preserved  since  they  will  have  a  good  fit  to  the  probability  model  and  require  very  few  bits  for 
compression. 

In  all  of  these  applications,  an  essential  role  was  played  by  the  information  theoretic  algorithms  for 
architecture  selection.  These  have  proven  themselves  as  reliable  and  computationally  efficient  guides  to 
the  generalization  capability  of  the  different  classifier  architectures. 
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ABSTRACT 

We  formulate  a  model  for  probability  distributions  on  image  spaces.  We  show  that  any  distribution  of  images  can  be 
factored  exactly  into  conditional  distributions  of  feature  vectors  at  one  resolution  (pyramid  level)  conditioned  on  the 
image  information  at  lower  resolutions.  We  would  like  to  factor  this  over  positions  in  the  pyramid  levels  to  make  it 
tractable,  but  such  factoring  may  miss  long-range  dependencies.  To  fix  this,  we  introduce  hidden  class  labels  at  each 
pixel  in  the  pyramid.  The  result  is  a  hierarchical  mixture  of  conditional  probabilities,  similar  to  a  hidden  Markov 
model  on  a  tree.  The  model  parameters  can  be  found  with  maximum  likelihood  estimation  using  the  EM  algorithm. 
We  have  obtained  encouraging  preliminary  results  on  the  problems  of  detecting  masses  in  mammograms. 

Keywords:  Mammography,  CAD,  Image  Probability 

1.  INTRODUCTION 

Many  approaches  to  object  recognition  in  images  estimate  Pr (class  |  image).  By  contrast,  a  model  of  the  proba¬ 
bility  distribution  of  images,  Pr(image),  has  many  attractive  features.  We  could  use  this  for  object  recognition 
in  the  usual  way  by  training  a  distribution  for  each  object  class  and  using  Bayes’  rule  to  get  Pr(class  |  image)  = 
Pr  (image  I  class)  Pr  (class)  /  Pr(image).  Clearly  there  are  many  other  benefits  of  having  a  model  of  the  distribution 
of  images,  since  any  kind  of  data  analysis  task  can  be  approached  using  knowledge  of  the  distribution  of  the  data. 
For  classification  we  could  attempt  to  detect  unusual  examples  and  reject  them,  rather  than  trusting  the  classifier’s 
output.  We  could  also  compress,  interpolate,  suppress  noise,  extend  resolution,  fuse  multiple  images,  etc. 

Many  image  analysis  algorithms  use  probability  concepts,  but  few  treat  the  distribution  of  images.  One  of  the  few 
examples  of  image  distribution  models  was  constructed  by  Zhu,  Wu  and  Mumford.^  They  compute  the  maximum 
entropy  distribution  given  a  set  of  statistics  for  some  features,  which  seems  to  work  well  for  textures  but  it  is  not 
clear  how  well  it  will  model  the  appearance  of  more  structured  objects. 

There  are  several  algorithms  for  modeling  the  distributions  of  features  extracted  from  the  image,  instead  of 
the  image  itself.  The  Markov  Random  Field  {MRF)  models  are  an  example  of  this  line  of  development;  see,  e.g., 
References  2,3.  However,  they  tend  to  be  very  computationally  expensive. 

In  De  Bonet  and  Viola’s  flexible  histogram  approach,^’^  features  are  extracted  at  multiple  image  scales,  and  the 
resulting  feature  vectors  are  treated  as  a  set  of  independent  samples  drawn  from  a  distribution.  The  distribution  of 
feature  vectors  is  then  modeled  using  Parzen  windows.  This  has  given  good  results,  but  the  feature  vectors  from 
neighboring  pixels  are  treated  as  independent  when  in  fact  they  share  exactly  the  same  components  from  lower- 
resolutions.  To  fix  this  one  might  build  a  model  in  which  the  features  at  one  pixel  of  one  pyramid  level  condition 
the  features  at  each  of  several  child  pixels  at  the  next  higher-resolution  pyramid  level.  The  multiscale  stochastic 
process  {MSP)  methods  do  exactly  that.  Luettgen  and  Willsky,®  for  example,  applied  a  scale-space  auto-regression 
(AR)  model  to  texture  discrimination.  They  use  a  quadtree  or  quadtree-like  organization  of  the  pixels  in  an  image 
pyramid,  and  model  the  features  in  the  pyramid  as  a  stochastic  process  from  coarse- to-fine  levels  along  the  tree.  The 
variables  in  the  process  are  hidden,  and  the  observations  are  sums  of  these  hidden  variables  plus  noise.  The  Gaussian 
distributions  are  a  limitation  of  MSP  models.  The  result  is  also  a  model  of  the  probability  of  the  observations  on 
the  tree,  not  of  the  image. 

All  of  these  methods  seem  well-suited  for  modeling  texture,  but  it  is  unclear  how  one  might  build  models  to 
capture  the  appearance  of  more  structured  objects.  We  will  argue  below  that  the  presence  of  objects  in  images  can 
make  local  conditioning  like  that  of  the  flexible  histogram  and  MSP  approaches  inappropriate.  In  the  following  we 
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Figure  1.  Pyramids  and  feature  notation. 


present  a  model  for  probability  distributions  of  images,  in  which  we  try  to  move  beyond  texture  modeling.  This 
hierarchical  image  probability  {HIP)  model  is  similar  to  a  hidden  Markov  model  on  a  tree,  and  can  be  learned  with 
the  EM  algorithm.  In  preliminary  tests  of  the  model  on  classification  tasks  the  performance  was  comparable  to  that 
of  other  algorithms. 

2.  COARSE-TO-FINE  FACTORING  OF  IMAGE  DISTRIBUTIONS 

Our  goal  will  be  to  write  the  image  distribution  in  a  form  similar  to  Pr(/)  ~  Pr(Fo  j  Fi)  Pr(Fi  |  F2) . . . ,  where  F^  is 
the  set  of  feature  images  at  pyramid  level  1.  We  expect  that  the  short-range  dependencies  can  be  captured  by  the 
model’s  distribution  of  individual  feature  vectors,  while  the  long-range  dependencies  can  be  captured  somehow  at 
low  resolution.  The  large-scale  structures  affect  finer  scales  by  the  conditioning. 

In  fact  we  can  prove  that  a  coarse-to-fine  factoring  like  this  is  correct.  Prom  an  image  I  we  build  a  Gaussian 
pyramid  (repeatedly  blur-and-subsample,  with  a  Gaussian  filter).  Call  the  Z-th  level  e.g.,  the  original  image  is  Iq 
(Figure  1).  Prom  each  Gaussian  level  //  we  extract  some  set  of  feature  images  F;.  Sub-sample  these  to  get  feature 
images  G/.  Note  that  the  images  in  G/  have  the  same  dimensions  as  Ii^i.  We  denote  by  G;  the  set  of  images 
containing  and  the  images  in  G;.  We  further  denote  the  mapping  from  Ii  to  G;  by  Qi. 

Suppose  now  that  Go  ‘  h  ^  Go  is  invertible.  Then  we  can  think  of  Go  as  a  change  of  variables.  If  we  have 
a  distribution  on  a  space,  its  expressions  in  two  different  coordinate  systems  are  related  by  multiplying  by  the 
Jacobian.  In  this  case  we  get  Pr(/o)  =  |0o|Pr(Go).  Since  Go  =  (Go,/i),  we  can  factor  Pr(Go)  to  get  Pr(/o)  = 
1^0 1  Pr(Go  I  /i)  Pr(/i).  If  Gi  is  invertible  for  alU  €  {0, . . . ,  L  -  1}  then  we  can  simply  repeat  this  change  of  variable 
and  factoring  procedure  to  get 


Pr(/)  = 


L~1 


]];iei|Pr(G,|//+i) 


^1=0 


Pr(/L) 


(1) 


This  is  a  very  general  result,  valid  for  all  Pr(/),  no  doubt  with  some  rather  mild  restrictions  to  make  the  change 
of  variables  valid.  The  restriction  that  Gi  be  invertible  is  strong,  but  many  such  feature  sets  are  known  to  exist,  e.g., 
most  wavelet  transforms  on  images. 


3,  THE  NEED  FOR  HIDDEN  VARIABLES 

For  the  sake  of  tractability  we  want  to  factor  Pr(G;  |  Ii^i)  over  positions,  something  like 

Pr(j)~n  n  ift+i(^)) 

I 

where  gi{x)  and  f/-|-i(x)  are  the  feature  vectors  at  position  x.  The  dependence  of  gi  on  expresses  the  persistence 
of  image  structures  across  scale,  e.g.,  an  edge  is  usually  detectable  as  such  in  several  neighboring  pyramid  levels.  The 
flexible  histogram  and  MSP  methods  share  this  structure. 


While  it  may  be  plausible  that  f/+i  (z)  has  a  strong  influence  on  gi{x),  a  model  distribution  with  this  factorization 
and  conditioning  cannot  capture  some  properties  of  real  images.  Objects  in  the  world  cause  correlations  and  non¬ 
local  dependencies  in  images.  For  example,  the  presence  of  a  particular  object  might  cause  a  certain  kind  of  texture 
to  be  visible  at  level  1.  Usually  local  features  by  themselves  will  not  contain  enough  information  to  infer  the 
object’s  presence,  but  the  entire  image  at  that  layer  might.  Thus  g/(a:)  is  influenced  by  more  of  ^/+i  than  the 
local  feature  vector. 

Similarly,  objects  create  long-range  dependencies.  For  example,  an  object  class  might  result  in  a  kind  of  texture 
across  a  large  area  of  the  image.  If  an  object  of  this  class  is  always  present,  the  distribution  may  factor,  but  if  such 
objects  aren’t  always  present  and  can’t  be  inferred  from  lower-resolution  information,  the  presence  of  the  texture  at 
one  location  affects  the  probability  of  its  presence  elsewhere. 

We  introduce  hidden  variables  to  represent  the  non-local  information  that  is  not  captured  by  local  features.  They 
should  also  constrain  the  variability  of  features  at  the  next  finer  scale.  Denoting  them  collectively  by  A,  we  assume 
that  conditioning  on  A  allows  the  distributions  over  feature  vectors  to  factor.  In  general,  the  distribution  over  images 
becomes 


Pr(/)oc53{n  n  P>:(&(^)|fm(^),^)Pr(^|/L+i)|Pr(/L+i). 

A  ^i=oxeii+i  ^ 


(2) 


As  written  this  is  absolutely  general,  so  we  need  to  be  more  specific.  In  particular  we  would  like  to  preserve 
the  conditioning  of  higher-resolution  information  on  coarser-resolution  information,  and  the  ability  to  factor  over 
positions. 

As  a  first  model  we  have  chosen  the  following  structure  for  our  HIP  model:* 


Pr(/)  oc  E  n  n  \ti+i,ai,x)  Pr(a/  |ai+i,a;)j 


(3) 


To  each  position  x  at  each  level  I  we  attach  a  hidden  discrete  index  or  label  ai{x).  The  resulting  label  image  Ai  for 
level  I  has  the  same  dimensions  as  the  images  in  G/. 

Since  ai{x)  codes  non-local  information  we  can  think  of  the  labels  Ai  as  a  segmentation  or  classification  at  the 
I  Ah  pyramid  level.  By  conditioning  a/(x)  on  ai^i{x),  we  mean  that  ai{x)  is  conditioned  on  a/+i  at  the  parent  pixel  of 
X.  This  parent-child  relationship  follows  from  the  sub-sampling  operation.  For  example,  if  we  sub-sample  by  two  in 
each  direction  to  get  G/  from  F^,  we  condition  the  variable  o;  at  (x^y)  in  level  I  on  a/+i  at  location  ([x/2j,  [y/2J)  in 
level  I  -h  1  (Figure  2).  This  gives  the  dependency  graph  of  the  hidden  variables  a  tree  structure.  Such  a  probabilistic 
tree  of  discrete  variables  is  sometimes  referred  to  as  a  belief  network.  By  conditioning  child  labels  on  their  parents 
information  propagates  though  the  layers  to  other  areas  of  the  image  while  accumulating  information  along  the  way. 


For  the  sake  of  simplicity  we’ve  chosen  Pr(g/  I  ft+i  ,  Of)  to  be  normal  with  mean  gi^ai  +  Afa, f/4-1  and  covariance 
Eoi,  that  is, 


Pr(g  I  f,  a)  =  VCg,  Mof  +  ga,  Aa) 


(4) 


4.  EM  ALGORITHM 

Due  to  the  tree  structure,  the  belief  network  for  the  hidden  variables  is  relatively  easy  to  train  with  an  EM  algorithm. 
The  expectation  step  (summing  over  a^’s)  can  be  performed  directly.  If  we  had  chosen  a  more  densely-connected 
structure  with  each  child  having  several  parents,  we  would  need  either  an  approximate  algorithm  or  Monte  Carlo 
techniques.  The  expectation  is  weighted  by  the  probability  of  a  label  or  a  parent-child  pair  of  labels  given  the  image. 
This  can  be  computed  in  a  fine-to-coarse-to-fine  procedure,  i.e.  working  from  leaves  to  the  root  and  then  back  out 
to  the  leaves.  The  method  is  based  on  belief  propagation,'^ 

*In  principle  there  is  also  a  factor  of  Pr(/L+i).  In  many  cases  Il+i  will  be  a  single  pixel  that  is  approximately  the  mean 
brightness  in  the  image.  We  ignore  this,  which  is  equivalent  to  assuming  that  Pr(/L+i)  is  flat  over  some  range.  In  this  case 
fL+i  is  zero  for  typical  features.  In  addition,  there  is  no  hidden  variable  ul+i*  If  we  combine  these  considerations  we  see  that 
the  I  =  L  factor  should  be  read  as  Ylx  \aLyX)  Pr(aL,  a;). 


Figure  2.  Tree  structure  of  the  conditional  dependency  between  hidden  variables  in  the  HIP  model.  With  subsam¬ 
pling  by  two,  this  is  sometimes  called  a  quadtree  structure. 


Once  we  can  compute  the  expectations,  the  normal  distribution  makes  the  M-step  tractable;  we  simply  compute 
the  updated  gan  j  ^an  1  as  combinations  of  various  expectation  values. 

In  order  to  apply  the  EM  algorithm,  we  need  to  choose  a  parameterization  for  the  model.  The  parameterization 
of  Pr(g  I  f,a)  is  given  above  in  Equation  4.  For  Pr(a/ 1  ai^i)  we  use  the  parameterization 

=  (5) 

2^ai  ^0.1+1 


in  order  to  ensure  proper  normalization. 

Below,  we  denote  the  new  parameter  values  computed  during  the  t-th  maximization  step  as  and  the  old 
values  as  6^. 


4.1.  MAXIMIZATION 

Maximizing  the  expectation  of  the  likelihood  over  the  hidden  variables  with  respect  to  the  model  parameters  gives 
the  following  update  formulae: 


'aiX+i  =  X^Pr(a/,a,+i,a;|/,6i*), 

X 

sit'  = 


=  ((&  - (g<  - - sit'sit'^ 

'  /  t,ai 

Here  the  brackets  (.)^  denotes  the  expectation  value 

_  E.Pr(ai,xlI,0*)X(x) 

^  E.Pr(ai,xlI,0‘)  ■ 


(6) 

(7) 

(8) 

(9) 


(10) 


4.2.  EXPECTATION 

In  the  E-step  we  need  to  compute  the  probabilities  of  pairs  of  labels  from  neighboring  layers  Pr(aijai^i,xi 
for  given  image  data.  But  note  that  in  all  occurrences  of  the  reestimation  equations,  i.e.  (5,6)  and  (10),  we  need 
that  quantity  only  up  to  an  overall  factor.  We  can  choose  that  factor  to  be  Ft{I\6^)  and  can  therefore  compute 
Fv{ai,ai-^i,xi^l\0^)  instead  using 

Pi{ai,ai+i,x\I,e^)PT{I\e^)=Fi{ai,a,+i,x,I\e*)=  ^  Pr(J,A|0*)  (11) 

A\az(a;),az+i(a;) 


The  computation  of  these  quantities  can  be  cast  as  recursion  formulae,  defined  in  terms  of  quantities  u  and  d,  which 
approximately  represent  upwards  and  downwards  propagating  probabilities.  The  recursion  formulae  are 


ui{ai,x) 

=  Pr(gilfi+i,ai,x)  JJ  ui-i(ai,x') 

(12) 

x'ech(x) 

(^f+i ) 

=  ^Pr(o;|o(+i)u;(ai,x) 

Cll 

(13) 

di{ai,x) 

=  ^Pr(o(|aj+i)J((aj+i,a;) 

(14) 

a/+i 

,  x) 

w/+i(ai+i,Par(a;))  d  /  w 

=  "  di+i{ai+i,Par{x)) 

ui[ai+i,x) 

(15) 

The  upward  recursion  relations  (12-13)  are  initialized  at  /  =  0  with  uo{ao,x)  =  Pr(g  |  fj ,  ao,  x)  and  end  at  i  =  L.  At 
layer  L  Equation  13  reduces  to  Since  we  do  not  model  any  further  dependencies  beyond  layer 

L,  the  pixels  at  layer  L  are  assumed  independent.  Considering  the  definition  of  u,  it  is  evident  that  the  product  of 
all  ul  (x)  coincides  with  the  total  image  probability, 


Pr(/|0*)  =  JJ  ul{x)  =  ul+1.  (16) 

xeiL 

The  downward  recursion  (14  -  15)  can  be  executed,  starting  with  equation  (15)  at  /  =  L  with  = 

dL+i(x)  =  I,"**  The  downwards  recursion  ends  at  /  =  0  with  equation  (14), 

We  can  now  compute  (11)  as 

Fr{ai ,  ai^i ,  x,  / 1  0^)  =  ui  {ai ,  x)di  (a/+i ,  x)  Pr (a/  |oz+i )  (17) 

Pr(az,x,/|0^)  =  ui{aiyx)di{ai^x)  (18) 

Obviously  computations  (12-18)  in  the  E-step  at  iteration  t  need  to  be  completed  with  fixed  parameters  9^. 

Because  of  the  dependence  of  gz  on  fz+i,  these  u’s  and  d’s  are  not,  in  general,  actual  probabilities.  In  spite  of 
this  it  can  be  shown  that  these  recursion  relations  are  correct. 

5.  EXPERIMENTS 

5.1.  CLASSIFICATION  OF  VEHICLES  IN  SAR  IMAGERY 

Though  not  a  medical  imaging  problem,  we  first  present  the  results  of  our  experiments  on  synthetic  aperture  radar 
(SAR)  imagery,  since  SAR  imagery  is  noisy  and  involves  detecting  an  extended  textured  object,  much  like  a  breast 
mass  and  many  other  medical  imaging  problems.  The  problem  was  to  discriminate  between  three  target  classes  in 
the  MSTAR  public  targets  data  set,  to  compare  with  the  results  of  the  flexible  histogram  approach  of  De  Bonet,  et 
al.^  We  trained  three  HIP  models,  one  for  each  of  the  target  vehicles  BMP-2,  BTR-70  and  T-72  (Figure  3).  As 
in  Reference  5  we  trained  each  model  on  ten  images  of  its  class,  one  image  for  each  of  ten  aspect  angles,  spaced 
approximately  36°  apart.  We  trained  one  model  for  all  ten  images  of  a  target,  whereas  De  Bonet  et  al  trained  one 
model  per  image. 

We  first  tried  discriminating  between  vehicles  of  one  class  and  other  objects  by  thresholding  logPr(/|  class),  i.e., 
no  model  of  other  objects  is  used.  In  essence  this  discriminates  simply  by  judging  whether  an  image  looks  sufficiently 
similar  to  the  training  examples.  For  the  tests,  the  other  objects  were  taken  from  the  test  data  for  the  two  other 
vehicle  classes,  plus  seven  other  vehicle  classes.  There  were  1,838  image  from  these  seven  other  classes,  391  BMP2 
test  images,  196  BTR70  test  images,  and  386  T72  test  images.  The  resulting  ROC  curves  are  shown  in  Figure  4a. 

We  then  tried  discriminating  between  pairs  of  target  classes  using  HIP  model  likelihood  ratios,  i.e.,  log  Pr(J  |  classl)  — 
logPr(J  I  class2).  Here  we  could  not  use  the  extra  seven  vehicle  classes.  The  resulting  ROC  curves  are  shown  in  Fig¬ 
ure  4b.  The  performance  is  comparable  to  that  of  the  flexible  histogram  approach. 

^The  (non-existent)  label  ol+i  can  be  thought  of  as  a  label  with  a  single  possible  value,  which  is  always  set.  The  conditional 
Pr(aL|aL+i)  turns  then  into  a  prior  Pr(aL) 


Figure  3.  SAR  images  of  three  types  of  vehicles  to  be  detected. 
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Figure  4.  ROC  curves  for  vehicle  detection  in  SAR  imagery,  (a)  ROC  curves  by  thresholding  HIP  likelihood  of 
desired  class,  (b)  ROC  curves  for  inter-class  discrimination  using  ratios  of  likelihoods  as  given  by  HIP  models. 

5.2.  MASS  DETECTION 

We  applied  HIP  to  the  problem  of  detecting  masses  in  ROIs  taken  from  mammograms,  as  detected  by  a  CAD  system 
at  the  University  of  Chicago.  We  trained  a  HIP  model  of  the  distribution  of  positive  images  on  36  randomly-chosen 
ROIs  that  contained  masses,  and  a  second  HIP  model  on  48  randomly-chosen  ROIs  without  masses.  The  likelihood 
ratio  was  then  used  as  the  test  criterion,  i.e.,  a  threshold  on  this  ratio  is  used  to  decide  which  ROIs  will  be  called 
masses.  The  true  and  false  positive  rates  as  a  function  of  the  threshold  were  measured  on  a  test  set  with  36  mass 
and  49  non-mass  ROIs. 

A  search  was  performed  over  the  number  of  hidden  labels  values  at  each  level.  The  search  criterion  was  the 
negative  log-likelihood  on  the  training  data  plus  the  minimum-description-length  penalty  term,  dlog(Ar)/2,  where  d 
is  the  number  of  model  parameters  and  N  is  the  the  number  of  training  examples.  The  maximum  number  of  labels 
in  a  level  was  bounded  (somewhat  arbitrarily)  at  17,  since  doubling  the  number  of  components  in  a  level  at  this  point 
was  observed  to  decrease  the  MDL  criterion,  but  very  little,  and  the  computation  time  would  approximately  double. 

The  best  architecture  had  17,  17,  11,  2,  and  1  hidden  label  in  levels  0-4,  respectively.  For  this  architecture,  Az 
was  0.73.  This  detector  had  a  specificity  of  33%  at  a  sensitivity  of  95%.  The  ROC  curve  is  shown  in  Figure  5. 
While  this  performance  is  not  as  good  as  we  might  hope,  being  worse  than  our  own  HPNN  classifier,®  for  instance,  it 
demonstrates  that  the  model  captures  relevant  information  for  classification.  We  hope  that  further  work,  particularly 
in  model  and  feature  selection,  will  improve  on  these  results. 

6.  CONCLUSION 

We  have  developed  a  class  of  image  probability  models  we  call  hierarchical  image  probability  or  HIP  models.  To 
justify  these,  we  showed  that  image  distributions  can  be  exactly  represented  as  products  over  pyramid  levels  of 
distributions  of  sub-sampled  feature  images  conditioned  on  coarser-scale  image  information.  We  argued  that  hidden 
variables  are  needed  to  capture  long-range  dependencies  while  allowing  us  to  further  factor  the  distributions  over 
position.  In  our  current  model  the  hidden  variables  act  as  indices  of  mixture  components.  The  resulting  model  is 


Figure  5.  ROC  curve  for  HIP  detector  of  Mass  ROIs  generated  by  U.  Chicago  CAD. 

somewhat  like  a  hidden  Markov  model  on  a  tree.  The  HIP  model  can  be  used  for  a  wide  range  of  image  processing 
tasks  besides  classification,  e.g.,  compression,  noise-suppression,  up-sampling,  error  correction,  etc. 

There  is  much  room  for  further  work  on  variations  of  the  specific  HIP  model  presented  here.  The  tree-structured 
discrete  hidden  variables  lend  themselves  well  to  exact  marginalization,  but  they  fail  to  capture  certain  image 
properties.  For  example,  contrast  level  and  orientation  could  be  given  continuous  parameterizations.  See,  for 
example,  the  work  of  Simoncelli  and  Wainwright,  who  developed  a  very  similar  model  to  capture  the  statistics 
of  contrast  level  (which  they  refer  to  as  “scale”),  though  they  did  not  formulate  their  model  as  an  image  probability.^ 
Furthermore,  as  is  well  known,  the  tree  structure  of  the  hidden  variable  dependencies  will  tend  to  artificially  suppress 
the  statistical  dependence  between  some  neighboring  pixels,  but  not  others.  Allowing  multiple  parents  would  alleviate 
this.  Unfortunately,  either  of  these  modifications  would  make  it  impractical  to  marginalize  over  the  hidden  variables, 
which  is  the  proper  probabilistic  procedure.  There  are  approximate  alternatives  to  exact  marginalization,  which 
should  allow  a  far  wider  variety  of  hidden  variable  structures. 
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ABSTRACT 

We  formulate  a  model  for  probability  distributions  on  im¬ 
age  spaces.  We  show  that  any  distribution  of  images  can 
be  factored  exactly  into  conditional  distributions  of  feature 
vectors  at  one  resolution  (pyramid  level)  conditioned  on  the 
image  information  at  lower  resolutions.  We  would  like  to 
factor  this  over  positions  in  the  pyramid  levels  to  make  it 
tractable,  but  such  factoring  may  miss  long-range  depen¬ 
dencies.  To  capture  long-range  dependencies,  we  introduce 
hidden  class  labels  at  each  pixel  in  the  pyramid.  The  result 
is  a  hierarchical  mixture  of  conditional  probabilities,  similar 
to  a  hidden  Markov  model  on  a  tree.  The  model  parameters 
can  be  found  with  maximum  likelihood  estimation  using  the 
EM  algorithm.  We  have  obtained  encouraging  preliminary 
results  on  the  problems  of  detecting  various  objects  in  SAR 
images  and  target  recognition  in  optical  aerial  images. 

1.  INTRODUCTION 

Many  approaches  to  object  recognition  in  images  estimate 
Pr(C'  1 1),  the  probability  that  an  object  of  class  C  is  present 
in  an  image  L  By  contrast,  a  model  of  the  probability  dis¬ 
tribution  of  images,  Pr(/ 1  C),  has  many  attractive  features. 
We  could  use  this  for  object  recognition  in  the  usual  way  by 
training  a  distribution  for  each  object  class  and  using  Bayes’ 
rule  to  get  Pr(C'  |  /),  or  by  using  the  likelihood  ratio  be¬ 
tween  Pr(/  I  C)  and  Pr(/  |  C).  Clearly  there  are  many  other 
uses  for  image  distributions,  since  any  kind  of  data  analysis 
task  can  be  approached  using  knowledge  of  the  distribution 
of  the  data.  For  classification  we  could  attempt  to  detect 
unusual  examples  and  reject  them,  rather  than  trusting  the 
classifier’s  output.  We  could  also  compress,  segment,  in¬ 
terpolate,  suppress  noise,  extend  resolution,  fuse  multiple 
images,  etc. 

Many  image  analysis  algorithms  use  probability  con¬ 
cepts,  but  few  treat  the  distribution  of  images,  e.g.,  maxi¬ 
mum  entropy  modeling  [1].  There  are  several  approaches 
that  do  not  model  the  probability  distribution  on  an  image 
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space,  but  motivated  our  work,  e.g.,  MRF  models  [2, 3],  the 
flexible  histogram  approach  [4, 5],  and  multiscale  stochastic 
processes  [6].  All  of  these  methods  seem  to  be  well-suited 
for  modeling  texture,  but  it  is  unclear  how  we  might  use 
them  to  capture  the  appearance  of  more  structured  objects. 

As  in  many  other  approaches,  we  model  the  distribu¬ 
tion  of  local  image  structxire  by  using  some  local  features, 
namely  the  outputs  of  some  filters,  and  capture  longer-range 
(either  in  scale  or  position)  dependencies  by  modeling  the 
influence  of  neighboring  structures  on  each  other.  However, 
we  argue  that  the  presence  of  objects  in  images  can  make 
local  conditioning  like  this  inadequate.  We  capture  these 
long-range  dependencies  by  using  hidden  variables.  The  de¬ 
pendencies  between  the  hidden  variables  in  our  model  are 
local,  like  those  in  some  MRF  models,  but  marginalizing 
over  them  introduces  long-range  dependencies.  We  expect 
that  such  hidden  variables  would  give  poor  models  of  ob¬ 
ject  structure  if  they  were  only  implemented  at  one  pyramid 
level.  Therefore  we  introduce  them  at  all  levels  in  a  pyra¬ 
mid,  and  give  them  coarse  to  fine  dependence. 


2.  THE  HIP  MODEL 


To  show  that  such  a  model  can  be  a  proper  distribution  on 
an  image  space,  we  show  that  any  distribution  on  an  image 
space  can  be  factored  into  a  coarse  to  fine  hierarchy  of  con¬ 
ditional  distributions.  From  an  image  I  we  build  a  Gaussian 
pyramid.  Call  the  /-th  level  Ii,  e.g.,  the  original  image  is 
Jo-  From  each  Gaussian  level  Ii  we  extract  some  set  of  fea¬ 
ture  images  F/  (Figure  1).  Sub-sample  these  to  get  feature 
images  G^,  so  that  the  images  in  G;  have  the  same  dimen¬ 
sions  as  //+i.  Denote  the  set  of  images  {1^+1,  GJ  by  G^ 
and  the  mapping  from  Ii  to  G/  by  Qi.  If  Gi  is  invertible  for 
all  /  G  {0, . . . ,  L  —  1}  it  is  easy  to  show  that 


Pr(J)  = 


rL-1 

niaHPr(G,|/m) 


'-/=o 


Pr(/x) 


(1) 


In  order  to  factor  Pr(G/ 1  )  over  positions,  we  intro¬ 

duce  hidden  variables.  There  is  enormous  freedom  in  this 
choice,  although  different  choices  can  be  easier  or  harder 


Fig.  1.  Pyramids  and  feature  notation. 


to  work  with.  One  simple  but  non-trivial  choice  is  to  intro¬ 
duce  an  image  Ai  of  integers  at  each  level  L  We  assume 
that  these  contain  enough  information  to  allow  us  to  factor 
Pr(G,|Jm)-  Furthermore  we  assume  that  the  local  hidden 
variable  ai  (a?)  and  the  local  lower-resolution  feature  vector 
f/+i  {x)  carry  all  of  the  information  in  that  is  relevant 
to  the  local  feature  vector  g/  (x) .  This  gives 

L 

Pr(J)  a  E  n  n  [Pr(a|fm,«/,a:) 

X  Pr(ai  |a/+i,x)j,  (2) 

where  ai^i  (x)  is  the  hidden  variable  at  the  parent  of  x  in 
the  tree  structure  given  by  the  sub-sampling  operation.  (To 
avoid  repeating  the  string  “(x)”,  we  specify  the  location  x 
as  a  conditioning  variable  in  each  Pr().) 

3.  TRAINING  WITH  EM 

This  model  can  be  fit  to  data  using  an  Expectation-Maxi¬ 
mization  (EM)  algorithm.  The  E-step  is  the  sum  over  hid¬ 
den  variables,  which  is  tractable  thanks  to  the  tree  struc¬ 
ture  of  their  dependencies.  We  choose  Pr(g/ 1  5+1,0^)  to 
be  normal  with  a  mean  that  depends  linearly  on  ft+i,  i.e., 
Pr(gj|fj+i,ai)  =N{Ma,ii+i  +  ga,,Aa,)- This  makes  the 
M-step  tractable,  and  is  rich  enough  to  reproduce  the  non- 
Gaussian  distribution  of  neighboring  features  on  each  other 
(see  [7]).  To  enforce  normalization  we  parameterize  the  la- 
bel  probabilities  as  Pr(a,  |  a,+i)  =  itauai+i  /  Ea,  • 

We  denote  by^  =  {go,,Mo,,  Aa,,7ra,,o,+i,Vaj,Vl}thevec- 
tor  of  all  parameters.  For  brevity  we  simply  reproduce  the 
relevant  formulas  without  derivations. 

To  compute  the  expectations  in  the  EM  algorithm  we 
need  the  joint  probabilities  of  the  image  and  individual  la¬ 
bels  at  a  position  and  pyramid  level.  These  are  given  as 

Pr(a( ,ai+i,x,I\e^)  =ui{ai,x)di (oi+i , x)  Pr(ai \ai+i) 

(3) 

Fi(ai,x,I\9*)  =ui{ai,x)di{ai,x),  (4) 


where  0^  is  the  parameter  vector  from  the  t-ih  EM  iteration. 
The  quantities  u  and  d  are  obtained  through  the  upward  and 
downward  recursion  relations 


ui{ai,x)  =  Pi(gi\fi+i,ai,x)  JJ  ui-i{ai,x')  (5) 

x'eCh{x) 

ui{ai+i,x)  =  ^Pr(ai|aj+i)wi(a/,a;)  (6) 

ai 

di{ai,x)  =  Y^Fi{ai\ai+i)di{ai+i,x)  (7) 

(8) 


Here  Ch(x)  is  the  set  of  pixel  locations  in  some  level  /  that 
are  children  of  pixel  x  in  level  /  H- 1  in  a  tree  relationship  of 
pixels  in  the  pyramid.  Similarly,  Par(x)  is  the  parent  pixel 
of  X. 

The  upward  recursion  relations  (5  -  6)  is  initialized  at 
I  =  0  with  uo(ao,  x)  =  Pr(g  |  fi,  ao,  x)  and  ends  aXl  —  L. 
At  layer  L  (6)  reduces  to  =  ul{x).^  Since 

we  do  not  model  any  further  dependencies  beyond  layer  L, 
the  pixels  at  layer  L  are  assumed  independent.  The  prod¬ 
uct  of  all  ul{x)  coincides  with  the  total  image  probability, 
Pr{/|^‘)  =  ux,{x)  =  ui^i.  The  downward  recur¬ 

sion  (7  -  8)  can  be  executed,  starting  with  equation  (8)  at 
I  =  LmihdL^i{aL-\-i^x)  —  dL+i{x)  =  1.^ 

For  the  update  equations,  let  us  denote  the  average  over 
position  at  level  I  weighted  by  Pr(a/,x  \I,0^)  by  (.)^ 
i.e., 


ExPr(a,,x|J,gOX(x) 


(9) 


Then  the  update  equations  for  the  Gaussian  parameters  are 


and 

=  {(g,  -  (g,  - 

(12) 

The  update  equation  for  the  label  probability  parameters  is 
^atX+i  ='^Pr(ai,ai+i,xlI,9*).  (13) 

X 

^The  (non-existent)  label  can  be  thought  of  as  a  label  with  a 
single  possible  value,  which  is  always  set.  The  conditional  Pr(ai;,  |  g-l+i) 
turns  then  into  a  prior  Pr(ajr,) 


Fig.  2.  Examples  of  positive  (left)  and  negative  (right)  ROIs 
for  the  aircraft  detection  problem.  Data  from  the  MassGIS 
at  http;  /  /ortho  .mi  t .  edu/nsdi/. 


Fig.  3.  values  from  a  jack-knife  study  of  detection  per¬ 
formance  of  HIP  and  HPNN  (hybrid  pyramid/neural  net¬ 
work)  models. 


Fig.  4.  SAR  images  of  three  vehicle  classes.  Data  from  the 
MSTAR  public  data  set. 


plus  seven  other  vehicle  classes.  There  were  1,838  image 
from  these  seven  other  classes,  391  BMP2  test  images,  196 
BTR70  test  images,  and  386  T72  test  images.  The  resulting 
ROC  curves  are  shown  in  Figure  5a. 

A  second  discrimination  criterion  that  uses  a  distribution 
is  the  likelihood  ratio,  log  Pr(/ 1  Ci )  *-  log  Pr(/ 1 C2) .  Here 
we  cannot  use  the  extra  seven  vehicle  classes.  The  result¬ 
ing  ROC  curves  are  shown  in  Figure  5b.  The  performance 
is  comparable  to  that  of  the  flexible  histogram  approach  of 
De  Bonet  et  al. 


4.  EXPERIMENTS 

We  have  applied  this  HIP  model  to  two  problems.  The  first 
was  to  detect  aircraft  in  aerial  photographs.  The  HEP  model 
performed  substantially  better  than  our  own  hybrid  pyramid 
neural  network  (HPNN)  algorithm  [8] .  (See  Figures  2  and 
3.)  (For  a  better  comparison  we  would  select  features  inde¬ 
pendently  for  the  HIP  and  HPNN  models.  The  HPNN  gave 
Az  =  0.86  with  a  different  set  of  features.) 

For  vehicle  discrimination  in  SAR,  we  performed  an  ex¬ 
periment  with  the  three  target  classes  in  the  MSTAR  pub¬ 
lic  targets  data  set,  to  compare  with  the  results  of  the  flex¬ 
ible  histogram  approach  of  De  Bonet,  et  al  [5].  We  trained 
three  HIP  models,  one  for  each  of  the  target  vehicles  BMP- 
2,  BTR-70  and  T-72  (Figure  4).  As  in  [5]  we  trained  each 
model  on  ten  images  of  its  class,  one  image  for  each  of  ten 
aspect  angles,  spaced  approximately  36°  apart.  We  trained 
one  model  for  all  ten  images  of  a  target,  whereas  De  Bonet 
et  al  trained  one  model  per  image. 

We  first  discriminated  between  vehicles  of  one  class  and 
other  objects  by  thresholding  log  Pr(/ 1  C),  i.e.,  no  model  of 
other  objects  is  used.  For  the  tests,  the  other  objects  were 
taken  from  the  test  data  for  the  two  other  vehicle  classes. 


5.  CONCLUSION 

We  have  presented  a  hierarchical  image  probability  (HIP) 
model  for  probability  distributions  of  images,  and  demon¬ 
strated  its  utility  in  a  pair  of  object  recognition  tasks.  The 
model  uses  hidden  class  labels  to  capture  long-range  de¬ 
pendencies.  A  distribution  model  has  many  potential  uses 
besides  recognition,  including  compression,  noise  suppres¬ 
sion,  novelty  detection,  segmentation,  etc. 

The  HIP  model  has  two  key  elements.  First  is  the  re¬ 
striction  that  the  features  be  invertible  to  make  the  model 
a  proper  probability  distribution  on  the  image  space.  It  ap¬ 
pears  to  be  possible  to  relax  these  restrictions  in  some  cases. 
Second  is  the  use  of  hidden  variables,  since  these  are  needed 
to  express  long-range  dependencies  in  the  model.  Our  cur¬ 
rent  hidden  variable  structure  was  chosen  for  tractability, 
since  we  can  explicitly  marginalize  the  hidden  variables  in 
this  structure.  Generalizations  like  choosing  a  connectivity 
denser  than  a  tree,  or  including  continuous  hidden  variables 
could  have  benefits,  but  we  would  need  approximations  to 
evaluate  the  probabilities.  There  is  much  room  for  further 
work  along  these  lines. 

We  are  also  working  on  sampling  from  HIP  models,  i.e., 
generating  random  images.  This  capability  provides  an  in- 


dependent  means  of  evaluating  the  model  that  is  not  avail¬ 
able  with  neural  network  models  of  Pr((7  [  /). 
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Abstract 

We  develop  a  probability  model  over  image  spaces  and 
demonstrate  its  broad  utility  in  mammographic  image  anal¬ 
ysis.  The  model  employs  a  pyramid  representation  to  factor 
images  across  scale  and  a  tree-structured  set  of  hidden  vari¬ 
ables  to  capture  long-range  spatial  dependencies.  This  fac¬ 
toring  makes  the  computation  of  the  density  functions  local 
and  tractable.  The  result  is  a  hierarchical  mixture  of  condi¬ 
tional  probabilities,  similar  to  a  hidden  Markov  model  on  a 
tree.  The  model  parameters  are  found  with  maximum  likeli¬ 
hood  estimation  using  the  EM  algorithm.  The  utility  of  the 
model  is  demonstrated  for  three  applications;  1)  detection 
of  mammographic  masses  in  computer-aided  diagnosis  2) 
qualitative  assessment  of  model  structure  through  mammo¬ 
graphic  synthesis  and  3)  lossless  compression  of  mammo¬ 
graphic  regions  of  interest. 

1.  Introduction 

In  mammographic  computer-assisted  diagnosis  (CAD)  one 
typically  estimates  Pr((7|/),  the  conditional  probability  of 
class  C  (e.g.  benign  vs.  malignant)  given  image  J  or  a 
set  of  features  extracted  from  I.  Previous  efforts  have  con¬ 
centrated  on  the  development  of  such  discriminant  models 
for  CAD  [1][2][3][4][?].  By  contrast,  a  generative  model, 
Pr(J|C),  has  many  attractive  features.  Classification  is  pos¬ 
sible  by  training  a  distribution  for  each  class  and  using 
Bayes’  rule  to  obtain  Pr(C7|/)  =  Pr(/|(7)  Pr(C)/ Pr(/). 
However  there  are  many  other  benefits  of  having  a  model  of 
the  distribution  of  images,  since  any  type  of  image  analysis 
can  be  approached  using  knowledge  of  the  distribution  of 
the  data.  For  example,  anomalous  images  can  be  detected 
and  rejected,  rather  than  trusting  the  classifier’s  output.  A 
generative  model  can  also  be  used  to  compress,  interpolate, 
suppress  noise,  increase  or  extend  resolution,  and  fuse  mul¬ 
tiple  images. 

In  the  computer  vision  and  pattern  recognition  commu¬ 
nity  there  has  been  limited  work  directed  at  developing 


probabilities  for  images.  One  of  the  few  examples  of  im¬ 
age  distribution  models  is  that  constructed  by  Zhu,  Wu  and 
Mumford[5].  In  their  approach  they  compute  the  maximum 
entropy  distribution  given  a  set  of  statistics  across  a  num¬ 
ber  of  features.  Though  this  approach  works  well  for  tex¬ 
tures,  it  is  not  clear  how  well  it  will  model  the  appearance 
of  more  structured  objects.  Several  algorithms  have  investi¬ 
gated  modeling  the  distributions  of  features  extracted  from 
the  image,  instead  of  the  image  itself.  The  Markov  Ran¬ 
dom  Field  {MRF)  models  are  one  such  example;  see,  e.g., 
References  [6,  7].  However,  these  models  tend  to  be  com¬ 
putationally  expensive. 

Recently,  De  Bonet  and  Viola’s  proposed  a  flexible  his¬ 
togram  approach[8,  9],  where  features  are  extracted  at  mul¬ 
tiple  image  scales,  with  the  resulting  feature  vectors  treated 
as  a  set  of  independent  samples  drawn  from  a  distribution. 
The  distribution  of  feature  vectors  is  then  modeled  using 
Parzen  windows.  Though  they  report  good  results,  their 
model  treats  the  feature  vectors  from  neighboring  pixels 
as  independent  samples  when  in  fact  they  share  exactly  the 
same  components  from  lower-resolutions.  One  solution  to 
this  is  to  build  a  model  in  which  the  features  at  one  pixel 
of  one  pyramid  level  condition  the  features  at  each  of  sev¬ 
eral  child  pixels  at  the  next  higher-resolution  pyramid  level. 
The  multiscale  stochastic  process  {MSP)  methods  do  ex¬ 
actly  that.  Luettgen  and  Willsky[10],  for  example,  applied  a 
scale-space  auto-regression  (AR)  model  to  texture  discrim¬ 
ination.  They  use  a  quadtree  or  quadtree-like  organization 
of  the  pixels  in  an  image  pyramid,  and  model  the  features 
in  the  pyramid  as  a  stochastic  process  from  coarse-to-fine 
levels  along  the  tree.  The  variables  in  the  process  are  hid¬ 
den,  and  the  observations  are  sums  of  these  hidden  variables 
plus  noise.  However  the  assumed  Gaussian  distributions  are 
a  limitation  of  MSP  models  as  well  as  the  fact  that  the  model 
is  of  the  probability  of  the  observations  on  the  tree,  not  of 
the  image. 

All  of  these  methods  appear  well-suited  for  modeling 
texture,  but  it  is  unclear  how  one  might  build  models  to 
capture  the  appearance  of  more  structured  objects.  For 
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example,  in  manamography,  benign  and  malignant  masses 
tend  to  be  characterized  by  a  combination  of  texture  and 
shape  features[ll]  and  may  also  include  contextual  influ¬ 
ences.  Therefore  local  conditioning,  like  that  of  the  flexible 
histogram  and  MSP  approaches,  is  inadequate. 

Recently,  several  groups  have  developed  what  are  essen¬ 
tially  extensions  of  the  MSP  models  by  adding  hidden  vari¬ 
ables.  These  can  be  seen  as  improving  the  model’s  ability  to 
capture  non-local  dependencies  in  the  image.  For  example, 
Crouse  et  al  developed  their  Hidden  Markov  Tree  {HMT) 
models  [12]  for  signals  and  images.  A  primary  motivation 
of  these  models  is  to  capture  the  tendency  for  wavelet  co¬ 
efficients  to  group  into  two  classes,  one  with  large  and  the 
other  with  small  coefficent  magnitudes.  Thus  their  hidden 
states  have  one  of  two  values  corresponding  to  large  and 
small  wavelet  coefficients.  This  is  well  suited  to  the  many 
signal  and  image  types  that  have  homogeneous  regions  with 
boundaries.  These  models  have  been  successfully  applied 
to  several  problems,  especially  image  denoising  and  tex¬ 
ture  segmentation.  Cheng  and  Bouman  [13]  applied  an¬ 
other  model  of  this  sort  for  segmentation,  in  which  the  ob¬ 
served  class  labels  play  the  role  of  hidden  variables,  and  so 
of  course  are  no  longer  hidden. 

We  have  independently  developed  a  class  of  models  for 
probability  distributions  of  images  that  we  call  hierarchi¬ 
cal  image  probability  (HIP)  models.  These  also  have  tree- 
structured  graph  of  the  dependencies  between  hidden  vari¬ 
ables  at  different  scales,  and  use  mixtures  of  multivariate 
Gaussians  to  model  the  local  distributions  of  vectors  of 
features.  In  the  following  we  present  the  basic  HEP  mod¬ 
els,  along  with  EM  algorithm  for  training  the  models.  We 
show  preliminary  results  of  the  application  of  HIP  models 
to  mammographic  image  analysis,  including  lesion  classifi¬ 
cation,  mammographic  synthesis  and  compression  of  mam¬ 
mographic  ROIs. 


2  Coarse-To-Fine  Factoring  Of  Im¬ 
age  Distributions 


Figure  1:  Pyramids  and  feature  notation. 


ages  F/.  Sub-sample  these  to  get  feature  images  G/.  Note 
that  the  images  in  G;  have  the  same  dimensions  as  Ji+i . 
We  denote  by  G;  the  set  of  images  containing  Iij^i  and  the 
images  in  G;.  We  further  denote  the  mapping  from  Ii  to  Gi 
byQi. 

Suppose  that  Go  is  invertible.  Then  we 

can  think  of  Go  as  a  change  of  variables.  If  we  have 
a  distribution  on  a  space,  its  expressions  in  two  differ¬ 
ent  coordinate  systems  are  related  by  multiplying  by  the 
Jacobian.  In  this  case  we  get  Pr(/o)  =  |0o|Pr(Go)- 
Since  Go  =  (Goj/i),  we  can  factor  Pr(Go)  to  get 
Pr(/o)  =  |5o|  Pr(Go  |  h)  Pr(/i).  If  Qi  is  invertible  for  all 
I  G  —  1}  then  we  can  simply  repeat  this  change 

of  variable  and  factoring  procedure  to  get 


rL-l 

-  ni 


Pr(/)=  niSHPr(GH/m) 


Pr(/L)  (1) 


This  is  a  very  general  result,  valid  for  all  Pr(/),  with 
some  rather  weak  restrictions  to  make  the  change  of  vari¬ 
ables  valid.  The  restriction  that  Gi  be  invertible  is  strong, 
but  many  such  feature  sets  are  known  to  exist,  e.g.,  most 
wavelet  transforms  on  images. 


3  The  Need  For  Hidden  Variables 


Our  goal  will  be  to  write  the  image  distribution  in  a  form 
similar  to  Pr(7)  ~  Pr(Fo  |  Fi)  Pr(Fi  |  F2)  - . . ,  where  F/ 
is  the  set  of  feature  images  at  pyramid  level  1.  We  expect 
that  the  short-range  dependencies  can  be  captured  by  the 
model’s  distribution  of  individual  feature  vectors,  while  the 
long-range  dependencies  can  be  captured  at  low  resolution. 
The  large-scale  structures  affect  finer  scales  by  the  condi¬ 
tioning. 

We  first  prove  that  a  coarse-to-fine  factoring  like  this  is 
correct.  From  an  image  I  we  build  a  Gaussian  pyramid  (re¬ 
peatedly  blur-and-subsample,  with  a  Gaussian  filter).  Call 
the  l-ih  level  //,  e.g.,  the  original  image  is  Iq  (Figure  1). 
From  each  Gaussian  level  //  we  extract  a  set  of  feature  im¬ 


For  the  sake  of  tractability  we  want  to  factor  Pt{Gi  |  Ii^i) 
over  positions,  for  example 

pr(/)~n  n  Hsi{x)\{i+i{x)) 

I  xeii+i 

where  gi(x)  and  f;+i(x)  are  the  feature  vectors  at  position 
x.  The  dependence  of  g;  on  expresses  the  persistence 
of  image  structures  across  scale,  e.g.,  an  edge  is  usually  de¬ 
tectable  as  such  in  several  neighboring  pyramid  levels.  The 
flexible  histogram  and  MSP  methods  share  this  structure. 

While  it  may  be  plausible  that  fi+i  (x)  has  a  strong  influ¬ 
ence  on  gi(x),  a  model  distribution  with  this  factorization 
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and  conditioning  cannot  capture  some  properties  of  real  im- 
ages.  Objects  in  the  world  cause  correlations  and  non-local 
dependencies  in  images.  For  example,  the  presence  of  a  par¬ 
ticular  object  might  cause  a  certain  kind  of  texture  to  be  visi¬ 
ble  at  level  L  Usually  local  features  by  themselves  will 
not  contain  enough  information  to  infer  the  object’s  pres¬ 
ence,  but  the  entire  image  Ii^i  at  that  layer  might.  Thus 
gi(a;)  is  influenced  by  more  of  than  the  local  feature 
vector. 

Similarly,  objects  create  long-range  dependencies.  For 
example,  an  object  class  might  result  in  a  specific  kind  of 
texture  across  a  large  area  of  the  image  (e.g.  malignant 
breast  masses  tend  to  have  inhomogenous  region  enhance¬ 
ment).  If  an  object  of  this  class  is  always  present,  the  distri¬ 
bution  may  factor,  but  if  such  objects  are  not  always  present 
and  cannot  be  inferred  from  lower-resolution  information, 
the  presence  of  the  texture  at  one  location  affects  the  prob¬ 
ability  of  its  presence  elsewhere. 

To  capture  these  long-range  dependencies  we  introduce 
hidden  variables  to  represent  the  non-local  information  that 
is  not  captured  by  local  features.  These  hidden  varibles  also 
constrain  the  variability  of  features  at  the  next  finer  scale. 
Denoting  the  hidden  variables  collectively  by  A,  we  assume 
that  conditioning  on  A  allows  the  distributions  over  feature 
vectors  to  factor.  In  general,  the  distribution  over  images 
becomes 

Pr(/)oc^|jJ  JJ  Pr(gj(a;)  |f;+i(a:),^) 

A  M=0a;e/i+i 

xPr(>l|4+i)|pr(/i+i).  (2) 

This  is  a  very  general  form  for  A  and  we  instead  would 
like  to  be  more  specific.  In  particular  we  would  like  to  pre¬ 
serve  the  conditioning  of  higher-resolution  information  on 
coarser-resolution  information,  and  the  ability  to  factor  over 
positions.  This  lead  to  the  following  structure  for  our  HIP 
model:  ^ 

L 

Pr(/)a  n  n  [p^gi  Ifi+i.o/.a:) 

Aoy...^Ai, 

xPr(oi|o/+i,a:)]  (3) 

To  each  position  x  at  each  level  I  we  attach  a  hidden  discrete 
index  or  label  ai{x).  The  resulting  label  image  Ai  for  level 
I  has  the  same  dimensions  as  the  images  in  G; . 

^In  principle  there  is  also  a  factor  of  Pr(Jj[,4-i).  In  many  cases  /l+i 
will  be  a  single  pixel  that  is  approximately  the  mean  brightness  in  the  im¬ 
age.  We  ignore  this,  which  is  equivalent  to  assuming  that  Pr{/L+i)  is  flat 
over  some  range.  In  this  case  is  zero  for  typical  features.  In  addition, 

there  is  no  hidden  variable  cl+i  •  If  we  combine  these  considerations  we 
see  that  the  I  =  L  factor  should  be  read  as  Hx  Pr(gL  |  ol  » a?)  Pr(aL ,  x). 


Ai+2 


A|+i 


A, 


Figure  2:  Quadtree  structure  of  the  conditional  dependency 
between  hidden  variables  in  the  HIP  model. 

Since  ai{x)  codes  non-local  information  we  can  think 
of  the  labels  Ai  as  a  learned  segmentation  at  the  /-th  pyra¬ 
mid  level.  By  conditioning  ai{x)  on  ai^i{x),  we  mean  that 
ai  (x)  is  conditioned  on  ai^i  at  the  parent  pixel  of  x.  This 
parent-child  relationship  follows  from  the  sub-sampling  op¬ 
eration.  For  example,  if  we  sub-sample  by  two  in  each  di¬ 
rection  to  get  G;  from  F;,  we  condition  the  variable  ai  at 
{x^y)  in  level  I  on  at  location  (Lar/2j,  [2//2J)  in  level 
/+!  (Figure  2).  This  gives  a  tree  structure  to  the  dependency 
graph  of  the  hidden  variables,  i.e.  a  belief  network.  By 
conditioning  child  labels  on  their  parents  information  prop¬ 
agates  though  the  layers  to  other  areas  of  the  image  while 
accumulating  information  along  the  way. 

For  simplicity  we  have  chosen  Pr(gz  |  ft+i  ,  Of)  to  be  nor¬ 
mal  with  a  mean  that  depends  linearly  on  fi+i. 

Pr(g|f,a)  =  jV(g,Mof +  go,Aa)  (4) 

4  EM  Algorithm 

Due  to  the  tree  structure,  the  belief  network  for  the  hidden 
variables  is  relatively  straightforward  to  train  with  an  EM 
algorithm.  The  expectation  step  (summing  over  a^’s)  can 
be  performed  directly.^  The  expectation  is  weighted  by  the 
probability  of  a  label  or  a  parent-child  pair  of  labels  given 
the  image.  This  can  be  computed  in  a  fine-to-coarse-to-fine 
procedure,  i.e.  working  from  leaves  to  the  root  and  then 
back  out  to  the  leaves.  The  method  is  based  on  belief  prop¬ 
agation  [14]. 

Once  the  expectations  are  computed,  the  normal  distribu¬ 
tion  makes  the  M-step  tractable;  one  simply  computes  the 
updated  gap  Sap  Map  and  Pr(a;  jai+i)  as  combinations 
of  various  expectation  values. 

In  order  to  apply  the  EM  algorithm,  a  parameteriza¬ 
tion  for  the  model  is  required.  The  parameterization  of 
Pr(g  I  f,  a)  is  given  above  in  Equation  4.  For  Fx{ai  \  o/+i) 

^Note  that  a  more  densely-connected  structure,  with  each  child  hav¬ 
ing  several  parents,  we  have  required  either  an  approximate  algorithm  or 
Monte  Carlo  techniques. 
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we  use  the  parameterization 


Pr(a,|ai+i)=  (5) 

2^ai 

in  order  to  ensure  proper  normalization. 

Below,  we  denote  the  new  parameter  values  computed 
during  the  t-th  maximization  step  as  0*+^  and  the  old  values 
as  eK 

4.1  Maximization 

Maximizing  the  expectation  of  the  likelihood  over  the  hid¬ 
den  variables  with  respect  to  the  model  parameters  gives  the 
following  update  formulae: 

X 

(7) 

and 

=  ((a  -  (g,  - 

\  /  t,aj 

-gat'gal''^-  (9) 


probabilities.  The  recursion  formulae  are 


ui{ai,x)  =Pr(gi  |fj+i,ai,a;) 


X  n  Ui-i{ai,x') 

(12) 

®'6Ch(!t) 

ui{ai+i,x)  =  '^PT{ai\ai+i)ui{ai,x) 

ai 

(13) 

di{ai,x)  =  y^Pr(aj|a(+i)J/(a(+i,ar) 

(14) 

Cll  +  l 

5/_  w/+i(oj+i,Par(a;)) 

(15) 

xdj+i(a;+i,Par(x)) 

The  upward  recursion  relations  (12-13)  are  initialized  at 
I  =  0  with  uo{cio,  x)  =  Pr(g  |  fi,  Oq, x)  and  end  at  /  =  L, 
At  level  L  Equation  13  reduces  to 

Since  we  do  not  model  any  further  dependencies  beyond 
layer  L,  the  pixels  at  layer  L  are  assumed  independent. 
Considering  the  definition  of  tx,  it  is  evident  that  the  product 
of  all  ul{x)  coincides  with  the  total  image  probability, 

Pr(J|0*)=  n 

x£Il 

The  downward  recursion  (14-15)  can  be  executed,  start¬ 
ing  with  equation  (15)  at  /  =  L  with  dL+i(ai,+i,x)  = 
c^L+i(x)  =  1.^  The  downwards  recursion  ends  at  /  =  0 
with  equation  (14). 

We  can  now  compute  (11)  as 


Here  the  brackets  expectation  value 


(10) 


Pr(oj,oj+i,x,/|0*)  =  ui{ai,x)di{ai+i,x) 

X  Pr(oi|oi+i)  (17) 

Pr(o;,a:,/|0*)  =  ui{ai,x)di{ai,x)  (18) 


4.2  Expectation 

In  the  E-step  we  need  to  compute  the  probabilities  of  pairs 
of  labels  from  neighboring  layers  Pr(af ,  a;+i  ,xi\I,  0^)  for 
given  image  data.  Note  that  in  all  occurrences  of  the  reesti¬ 
mation  equations,  i.e.  (5,6)  and  (10),  we  require  that  quan¬ 
tity  only  up  to  an  overall  factor.  We  can  choose  that  factor  to 
be  Pr(J|^^)  and  can  compute  Pr(a;,  x;,  J|^*)  instead 
using 

Pr(oi,  Oi+i,  a;  1 1, Pr(J  1 0*)  =  PT{ai,ai+i,x,  I  \  $*) 
=  PriI,A\e^)  (11) 

a4\ai(x),aj+i(a;) 

The  computation  of  these  quantities  can  be  cast  as  recursion 
formulae,  defined  in  terms  of  quantities  u  and  d,  which  ap¬ 
proximately  represent  upwards  and  downwards  propagating 


Computations  (12-18)  in  the  E-step  at  iteration  t  are  done 
with  fixed  parameters  0^. 

5  Experimental  Results 

In  this  section  we  report  some  of  our  preliminary  results  for 
applying  the  HIP  model  to  mammographic  image  analysis. 

5.1  Mass  Detection 

To  demonstrate  utility,  we  use  HIP  as  a  post-processor 
(i.e.  adjunct)  to  the  University  of  Chicago's  (UofC)  CAD 
system[15].  False  positive  and  true  positive  regions  of  in¬ 
terest  (ROIs)  were  output  from  the  UofC  CAD  system  and 

^The  (non-existent)  label  a^+i  can  be  thought  of  as  a  label  with  a 
single  possible  value,  which  is  always  set.  The  conditional  Pr(Qi,  |  ol-i-i) 
turns  then  into  a  prior  Pr(aL ) 
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Figure  3:  ROC  curve  for  results  of  HEP  models  used  as 
a  post-processor  for  mass  detection  in  the  University  of 
Chicago’s  mammographic  CAD  system. 

used  for  training  and  testing.  The  goal  was  to  determine  if 
the  HIP  model  could  be  used  to  reduce  false  positives  with¬ 
out  significant  loss  in  sensitivity. 

Two  HIP  models  were  trained;  one  using  36  randomly- 
chosen  ROIs  that  contained  masses,  and  a  second  trained  on 
48  randomly-chosen  ROIs  without  masses.  The  likelihood 
ratio  under  the  two  models  was  used  as  the  test  criterion, 
i.e.,  a  threshold  on  this  ratio  is  used  to  decide  which  ROIs 
will  be  detected  as  masses.  The  true  and  false  positive  rates 
as  a  function  of  the  threshold  were  measured  on  a  novel  test 
set  consisting  of  36  mass  and  49  non-mass  ROIs. 

A  search  was  performed  over  the  number  of  hidden  la¬ 
bels  values  at  each  level.  The  search  criterion  used  the  neg¬ 
ative  log-likelihood  on  the  training  data  plus  the  minimum- 
description-length  penalty  term,  dlog(Ar) /2,  where  d  is  the 
number  of  model  parameters  and  N  is  the  the  number  of 
training  examples  [16].  The  maximum  number  of  labels  in 
a  level  was  bounded  at  17. 

The  best  perfroming  model  had  an  architecture  of  17, 
17, 1 1, 2,  and  1  hidden  label  in  levels  0-4,  respectively.  The 
receiver  operatin  characteristic  (ROC)  curve  [17]  for  the  test 
images  is  shown  in  Figure  3.  For  this  architecture  the  area 
under  the  curve  {Az)  was  0.75.  For  this  architecture  and  set 
of  parameter  the  HIP  model  is  able  to  eliminate  17%  of  the 
false  positives  generated  by  the  UofC  CAD  system,  without 
loss  in  sensitivity. 

5.2  Novelty  Detection 

Novelty  detection  identifies  examples  that  are  significantly 
different  from  the  examples  on  which  the  model(s)  was 


Effect  of  rejecting  low-probability  examples  on  performance 


Figure  4:  Using  the  HIP  model  for  novelty  detection  and 
generating  confidence  measures.  Thresholding  the  absolute 
value  of  the  likelihood  (abscissa)  enables  rejection  of  a  frac¬ 
tion  of  the  data  that  is  novel,  relative  to  the  data  on  which 
the  models  were  trained.  This  acts  as  a  confidence  mea¬ 
sure,  which  can  improve  the  performance  of  the  model  (Az 
values  on  ordinate  axis). 

trained  [18].  Detecting  novel  examples  can  be  useful  in 
a  CAD  system  for  generating  confidence  measures  on  the 
CAD  output  and  identifying  data  that  could  be  used  in  fu¬ 
ture  training  of  the  model.  The  HEP  model’s  generative 
structure  enables  novel  examples  to  be  identified  by  thresh¬ 
olding  the  log-likelihood  of  the  models.  Figure  4  illus¬ 
trates  how  ROC  performance  improves  if  novelty  detection 
is  used  to  generate  a  confidence  measure  for  rejecting  low- 
confidence  examples.  In  this  example,  two  FHP  models 
were  trained,  one  for  positive  ROIs  and  one  for  negatives 
ROIs  (same  ROI  database  as  for  classification).  Test  data 
was  evaluated  by  computing  the  likelihood  ratio  of  the  mod¬ 
els  as  well  as  the  absolute  value  of  the  log-likelihoods.  The 
absolute  value  of  the  log-likelihoods  are  thresholded  such 
that  low  values  are  considered  low  confidence  and  there¬ 
fore  rejected  (not  classified).  As  the  threshold  on  the  log- 
likelihood  is  increased,  more  ROIs  are  rejected  because  of 
low  confidence  and  the  area  under  the  ROC  curve  increases. 


5.3  Mammographic  Synthesis 

Since  the  HEP  model  is  a  generative  model,  we  can  sample 
the  model  and  synthesize  new  images.  In  the  context  of  ROI 
classification,  synthesized  images  can  provide  qualitative 
insight  into  what  features  the  model  is  extracting  and  repre¬ 
senting  for  both  positive  and  negative  ROIs.  Using  the  same 


5 


Figure  5:  Mammographic  ROI  images  synthesized  from  positive  and  negative  HEP  models.  Synthesized  positive  ROIs  (left) 
tend  to  have  more  focal  structure,  with  more  defined  borders  and  higher  spatial  frequency  content.  Negative  ROIs  (right)  tend 
to  be  more  amorphous  with  lower  spatial  frequency  content. 
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Figure  6:  (Left)Root  mean-squared  error  vs.  size  of  compressed  file,  JPEG  and  HIP.  (Right)  Maximum  error  (L^  norm)  vs. 
size  of  compressed  file,  JPEG  and  HIP. 


ROI  database  used  for  classification,  we  constructed  HEP 
models  for  positives  (masses)  and  negatives  (no  masses). 
The  trained  HIP  models  were  sampled  to  synthesize  new 
ROI  images.  The  sampling  procedure  begins  at  the  coarsest 
resolution,  where  the  hidden  labels  are  randomly  sampled 
from  the  distribution  Ft{Al)-  The  feature  images  G/,  are 
then  sampled  from  Pr(Gz,  |  Ai).  The  G^  are  used  to  con¬ 
struct  Jl-1,  from  which  the  F/,  are  constructed.  We  then 
sample  A^-i  from  Ft{Al^i  \Ai),  and  then  Gl-i  from 
Pr(GL_i  I  Fl,Al~i)‘  This  is  repeated  until  the  finest  res¬ 
olution  is  reached  and  /q  is  constructed. 

Figure  5  shows  examples  of  these  images.  Inspection 
of  the  synthesized  positive  ROIs  shows  more  focal  struc¬ 
ture,  with  more  well-defined  borders  and  higher  spatial  fre¬ 
quency  content  than  the  negative  ROIs. 


5.4  Mammographic  Image  Compression 

A  stream  of  random  variables  can  be  optimally  compressed 
if  we  know  their  distribution,  and  so  having  a  HEP  model  of 
a  source  of  images  should  allow  us  to  compress  examples 
of  those  images  with  high  efficiency.  Here  we  demonstrate 
compression  with  HEP  models  using  a  simple  technique. 

Given  an  image  and  a  HIP  model,  we  compute 
the  most  likely  value  of  each  hidden  label,  o*(x)  = 
arg maXjjj (jj.)  Pr(o/,a;,/ 1 0^)  using  Equation  18,  and  code 
each  feature  vector  gi  (x)  using  Pr  (g;  |  ,  a* ,  a:) .  The  lat¬ 

ter  is  used  by  decomposing  g/  (x)  into  its  components  along 
the  eigenvectors  of  the  covariance  of  Pr(g/ 

Sa* ,  and  coding  those  components  with  a  specified  preci¬ 
sion  using  Huffman  encoders  for  the  Gaussian  distributions 
with  variances  given  by  the  eigenvalues  of  Ea* .  The  re¬ 
sulting  bitstream  was  stored  in  a  file  that  was  susbequently 


Figure  7:  Compression  artifacts  of  JPEG  and  HEP.  Left:  Original  image,  center:  JPEG,  right:  HIP. 


compressed  with  gzip  to  reduce  the  redundancy  in  the  many 
short  identical  bit  patterns.  This  procedure  is  currently  very 
computationally  expensive,  and  is  not  necessarily  optimal 
even  if  the  HIP  model  exactly  matches  the  image  distribu¬ 
tion,  but  it  is  straightforward  to  code  and  serves  to  demon¬ 
strate  the  capability. 

Figure  6  shows  the  root-mean-squared  and  maximum  er¬ 
rors  versus  the  size  of  the  reulting  compressed  file,  respec¬ 
tively,  This  is  for  one  randomly-chosen  mass  ROI  image, 
which  was  not  part  of  the  training  set  of  the  HIP  model. 
The  HIP  algorithm  gives  mean  errors  that  are  comparable 
to  JPEG,  and  suggests  that  its  maximum  errors  are  a  little 
lower.  It  is  perhaps  not  surprising,  since  the  HIP  model  was 
fit  to  similar  data  while  JPEG  is  intended  to  be  general,  but  it 
demonstrates  the  potential.  Compressed  and  uncompressed 
images  are  shown  in  Figure  7. 

6  Conclusion 

We  have  developed  a  class  of  image  probability  models  we 
call  hierarchical  image  probability  or  HIP  models.  To  jus¬ 
tify  these,  we  showed  that  image  distributions  can  be  ex¬ 
actly  represented  as  products  over  pyramid  levels  of  dis¬ 
tributions  of  sub-sampled  feature  images  conditioned  on 
coarser- scale  image  information.  We  argued  that  hidden 
variables  are  needed  to  capture  long-range  dependencies 
while  allowing  us  to  further  factor  the  distributions  over 
position.  In  our  current  model  the  hidden  variables  act  as 
indices  of  mixture  components.  The  resulting  model  is  very 
similar  to  the  Hidden  Markov  Tree  models,  but  allows  mod¬ 
elling  somewhat  more  general  image  structures.  Because 
they  are  models  of  probability  distributions  over  images, 
they  can  be  used  for  a  wide  range  of  image  processing 
tasks  e.g.  classification,  compression,  noise-suppression, 
up-sampling,  error  correction,  etc.  Here  we  have  presented 
results  for  mammographic  image  analysis.  However  there 
are  obviously  other  modalities  and  medical  application  ar¬ 
eas  were  HIP  models  would  be  useful.  One  in  particular 
is  multi-modal  fusion,  where  the  problem  is  to  bring  a  set 
of  images,  acquired  using  different  imaging  modalities,  into 
alignment.  One  method  that  has  demonstrated  particularly 
good  performance  uses  mutual  information  as  an  objective 


criterion  [19].  The  computation  of  mutual  information  re¬ 
quires  an  estimate  of  entropies,  which  in  turn  requires  an 
estimate  of  the  underlying  densities  of  the  images.  The  HEP 
model  potentially  provides  a  framework  for  learning  those 
densities. 
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Abstract 

This  paper  describes  a  pattern  recognition  architecture,  which  we  term  hierarchical 
pyramid/neural-network  (HPNN),  that  learns  to  exploit  image  structure  at  multiple- 
resolutions  for  detecting  clinically  significant  features  in  digital/digitized  mammograms. 
The  HPNN  architecture  consists  of  a  hierarchy  of  neural  networks,  each  network 
receiving  feature  inputs  at  a  given  scale  as  well  as  features  constructed  by  networks  lower 
in  the  hierarchy.  Networks  are  trained  using  a  novel  error  function  for  the  supervised 
learning  of  image  search/detection  tasks  when  the  position  of  the  objects  to  be  found  is 
uncertain  or  ill-defined.  We  have  evaluated  the  HPNN's  ability  to  eliminate  false  positive 
regions  of  interest  generated  by  the  University  of  Chicago's  (UofC)  Computer-aided 
Diagnosis  (CAD)  systems  for  microcalcification  and  mass  detection.  Results  show  that 
the  HPNN  architecture,  trained  using  the  UOP  error  function,  reduces  the  false  positive 
rate  of  a  mammographic  CAD  system  by  approximately  50%  without  significant  loss  in 
sensitivity.  Investigation  into  the  types  of  false  positives  that  the  HPNN  eliminates 
suggests  that  the  pattern  recognizer  is  automatically  learning  and  exploiting  contextual 
information.  Clinical  utility  is  demonstrated  through  the  evaluation  of  an  integrated 
system  in  a  clinical  reader  study.  We  conclude  that  the  HPNN  architecture  learns 
contextual  relationships  between  features  at  multiple  scales  and  integrates  these  features 
for  detecting  microcalcifications  and  breast  masses. 

Keywords:  mammography,  computer-aided  diagnosis,  hierarchical  pyramid  neural 


network,  context 
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I.  Introduction 

Computer-aided  diagnosis  (CAD)  can  be  defined  as  a  diagnosis  made  by  a  radiologist 
who  incorporates  the  results  of  computer  analyses  of  the  radiographs  [1].  The  goal  of 
CAD  is  to  improve  radiologists’  performance  by  indicating  the  sites  of  potential 
abnormalities,  to  reduce  the  number  of  missed  lesions,  and/or  by  providing  quantitative 
analysis  of  specific  regions  in  an  image  to  improve  diagnosis.  CAD  systems  typically 
operate  as  automated  “second-opinion”  or  “double-reading”  systems  that  indicate  lesion 
location  and/or  type.  Since  individual  human  observers  overlook  different  findings,  it  has 
been  shown  that  “double  reading”  (the  review  of  a  study  by  more  than  one  observer) 
increases  the  detection  rate  of  breast  cancers  by  5-15%  [2][3][4].  Double  reading,  if  not 
done  efficiently,  can  significantly  increase  the  cost  of  screening.  Methods  to  provide 
improved  detection  with  little  increase  in  costs  will  have  significant  impact  on  the 
benefits  of  screening.  Automated  CAD  systems  are  a  promising  approach  for  low-cost 
double-reading. 

Several  CAD  systems  have  been  in  development  and  the  first  has  been  recently  approved 
by  the  FDA  [5].  Complete  systems  have  been  rigorously  characterized,  both  in 
retrospective  and  prospective  trials  [6].  Though  many  have  demonstrated  clinical  utility, 
there  is  still  a  need  to  reduce  false  positive  rates  generated  by  CAD  systems.  For 
example,  prospective  clinical  studies  have  shown  lower  sensitivities  and  specificities  than 
originally  found  in  retrospective  studies  —  80%  cancers  detected  with  2.4  false  positives 
per  case  in  prospective  studies  versus  85-90%  sensitivity  at  1-2  false  positives  per  image 
in  retrospective  studies  [7]. 


3 


A.  The  Role  of  Neural  Networks  in  CAD 

CAD  systems  usually  consist  of  two  distinct  subsystems,  one  designed  to  detect 
microcalcifications  and  one  to  directly  detect  masses  [8].  A  common  element  in  both 
subsystems  is  a  neural  network,  used  to  improve  detection  and  reduce  false  positive  rates. 
Figure  1  shows  a  typical  CAD  system  processing  flowchart,  generalized  for  either 
microcalcification  or  mass  detection.  The  first  two  stages  of  the  CAD  system  increase  the 
overall  signal-to-noise  levels  in  the  image  and  apply  rules/heuristics  to  define  a  set  of 
candidate  regions-of-interest  (ROIs).  These  stages  have  adjustable  parameters  that 
typically  are  set  to  produce  a  very  high  sensitivity,  usually  at  a  cost  of  low  specificity. 
The  final  stage  is  a  statistical  model  or  neural  network,  whose  parameters  are  found  using 
error-based  optimization  given  a  set  of  training  data.  The  function  of  this  last  stage  is  to 
reduce  false  positives  (i.e.,  increase  specificity)  without  significant  loss  in  sensitivity. 
Neural  networks  are  a  particularly  important  class  of  statistical  models  in  CAD  because 
they  are  able  to  capture  complicated,  often  nonlinear,  relationships  in  high  dimensional 
feature  spaces  not  easily  captured  by  heuristic  or  rule  based  algorithms.  Several  groups 
have  developed  neural  networks  architectures  for  CAD.  Some  of  these  architectures 
exploit  well-known  features  that  might  also  be  used  by  radiologists  [9][10][1 1],  while 
others  utilize  more  generic  feature  sets  [12][13][14][25].  Both  approaches  have  been 
shown  to  be  useful  for  detecting  clinically  significant  mammographic  anomalies. 


Kinsert  figure  1  here> 
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B.  Exploiting  Context  in  Mammographic  Image  Analysis 

Context  can  be  defined  as  nearby  or  surrounding  structure  that  establishes  the  meaning  or 
identity  of  an  object.  In  image  analysis,  contextual  information  is  often  used  to  detect 
and  classify  visual  objects.  For  example,  detecting  a  small  building  in  an  aerial  image 
can  be  facilitated  by  searching  along  roads,  since  buildings  tend  to  lie  in  close  proximity 
to  roads.  Both  human  observers  and  computer  vision  systems  (e.g.  [15] )  have  been 
developed  to  exploit  contextual  relationships  in  imagery.  Likewise,  in  mammographic 
image  analysis  context  is  exploited  by  radiologists  and  mammographers  for  detecting  and 
identifying  breast  abnormalities.  The  clustering  of  calcifications,  their  proximity  to 
ductal  tissue,  the  architectural  distortion  surrounding  potential  lesions,  are  all  contextual 
cues  used  by  radiologists  and  mammographers  [16].  Contextual  relationships  can  be 
integrated  into  mammographic  CAD  systems,  being  made  explicit,  given  known 
pathology,  through  incorporation  of  preset  rules  and/or  feature  detectors  tuned  to  capture 
the  context.  Alternatively,  contextual  relationships  can  be  learned  from  the  data, 
allowing  for  more  complicated  and  less  obvious  contextual  cues  to  be  uncovered  by  the 
pattern  recognition  system. 

C.  Overview  of  Hierarchical  Pyramid/Neural  Network  Architecture 

We  have  developed  a  pattern  recognition  architecture  that  learns  contextual  relationships 
between  structure  in  images  for  detection  and  classification  of  objects.  Fundamental  to 
the  architecture  is  the  multi-scale  decomposition  of  an  image,  via  pyramid  transforms 
[20],  and  the  subsequent  integration  of  multi-scale  image  features  by  a  hierarchy  of 
neural  networks.  These  fundamental  aspects  of  the  architecture  led  to  the  name 
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hierarchical  pyramid  neural  networks  (HPNN).  Several  variants  of  the  HPNN  can  be 
defined,  dependent  upon  the  direction  of  processing  in  the  hierarchy.  Figure  2  illustrates 
the  general  coarse-to-fine  and  fine-to-coarse  architectures.  These  two  architectures  detect 
small  or  large  target  object  by  exploiting  coarse-scale  (low  resolution)  or  fine-scale  (high- 
resolution)  information  associated  with  the  target.  For  example,  in  the  coarse-to-fine 
HPNN  networks  operating  at  low  resolution  learn  contextual  features  that  are  passed  to 
networks  operating  at  high  resolution  and  integrated  to  detect  the  object  of  interest  (i.e. 
the  contextual  inputs  condition  the  probability  of  target  present).  For  the  fine-to-coarse 
HPNN  architecture  networks  extract  detail  structure  at  fine  resolutions  of  the  image  and 
then  pass  this  detail  information  to  networks  operating  at  coarser  scales  (see  figure  2B). 
For  many  types  of  objects,  information  about  the  fine  detail  structure  is  important  for 
discrimination  between  different  classes,  i.e.,  fine  resolution  structure  occurring  within 
the  context  of  the  coarse  resolution  structure  is  indicative  of  an  object  class. 

Kinsert  figure  2  here> 

We  have  previously  reported  on  how  the  HPNN  architectures  and  learning  algorithms  can 
improve  detection  for  a  general  class  of  image  search/detection  problems  [17][18][19]. 
For  example,  we  have  shown  that  for  the  problem  of  detecting  small  buildings  in  aerial 
imagery,  the  coarse-to-fine  HPNN  architecture  has  higher  accuracy  than  both 
conventional  neural  network  architectures  and  standard  statistical  classification 
techniques  [17].  In  this  paper  we  present  our  results  of  applying  the  HPNN  framework  to 
two  problems  in  manunographic  CAD;  detecting  microcalcifications  and  masses  in 


6 


digital/digitized  mammograms.  The  coarse-to-fine  HPNN  architecture  is  well-suited  for 
the  microcalcification  problem,  while  the  fine-to-coarse  HPNN  is  suited  for  mass 
detection.  We  evaluate  the  performance  and  utility  of  the  HPNN  framework  by 
considering  its  effects  on  reducing  false  positive  rates  in  a  well-characterized  CAD 
system  developed  by  The  University  of  Chicago  (UofC).  In  both  cases 
(microcalcification  and  mass  detection)  the  HPNN  acts  as  a  post-processor  of  the  UofC 
CAD  system. 

II.  Methods 

In  this  section  we  describe  three  critical  elements  of  the  HPNN;  1)  integrated  feature 
pyramid  representation,  2)  neural  network  hierarchy,  and  3)  the  learning  algorithm. 

A.  Integrated  Feature  Pyramids 

Image  features  are  extracted  and  represented  as  integrated  features  pyramids  (IFPs)  [20]. 
Multi-scale  pyramid  transforms  are  used  to  construct  the  IFP,  which  is  the  representation 
that  serves  as  input  into  the  neural  network  hierarchy.  The  pyramid  transformation  for 
the  current  set  of  experiments  is  based  on  a  general  class  of  filters  that  measure 
orientation  energy  and  image  intensity  gradients. 

For  the  coarse-to-fine  IFP,  steerable  filters  [21]  are  used  to  compute  local  oriented 
gradient  information  across  scale.  The  steering  properties  of  these  filters  enable  the  direct 
computation  of  the  orientation  having  maximum  energy.  Features  are  constructed  which 
represent,  at  each  pixel  location,  the  maximum  energy  (energy  at  orientation  dmax),  the 
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energy  at  the  orientation  perpendicular  to  6„ax  {Omax  -90°),  and  the  energy  at  the  diagonal 
(energy  at  9max  ^5).  Figure  3a  illustrates  the  form  of  the  IFF  input  into  the  coarse-to-fine 
network  hierarchy. 


Kinsert  figure  3  here> 

The  IFF  for  mass  detection  is  slightly  different  from  the  coarse-to-fine  IFF  for 
microcalcification  detection  (figure  3b).  For  mass  detection,  input  to  the  fine-to-coarse 
neural  network  hierarchy  is  an  IFF  having  radial  and  tangential  gradient  components  at 
each  resolution,  relative  to  the  mass  center.  The  features  are  filtered  versions  of  the 
image,  with  filter  kernels  given  by 


f 


¥,,p(r,e)  = 


^2 


^(q  +  \p\y- 


ip<p 


(1) 


in  polar  coordinates,  with  {q,  p)  G  { (0,1),  (1,0),  (0,2) } .  These  are  combinations  of 
derivatives  of  Gaussians,  and  can  be  written  as  combinations  of  separable  filter  kernels 
and  can  therefore  be  computed  at  relatively  low  cost.  They  are  also  straightforward  to 
steer,  being  just  a  multiplication  by  a  complex  phase  factor.  These  filters  are  steered  in 
the  radial  and  tangential  directions  relative  to  the  mass  centers,  using  the  real  and 
imaginary  components  and  their  squares  and  products,  as  features.'  Features  were 


‘  The  center  coordinates  of  the  masses  are  generated  by  earlier  stages  of  the  CAD  system. 
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extracted  at  each  level  of  the  Gaussian  pyramid  representation  of  the  mass  ROI,  and  used 
as  inputs  to  networks  at  the  same  level. 

B.  Neural  Network  Hierarchy 

The  neural  networks  in  the  HPNN  are  multi-layer  perceptrons,  having  one  hidden 
layer  with  between  4—8  hidden  units.  The  number  of  hidden  units  is  chosen  via  cross- 
validation  [22].  All  units  in  a  network  perform  a  weighted  (wj)  sum  of  their  inputs  (Xj), 
subtracting  an  offset  or  threshold  (0)  from  that  sum  to  get  the  activation  (a) 

a  =  '^WiXi-e  (2) 


The  activation  is  transformed  into  a  unit's  output,  y,  by  passing  it  through  a  sigmoid 
function 

y  =  a(a)  =  —^  (3) 

1  +  6-“ 

Each  network  in  the  HPNN  hierarchy  receives  input  from  the  integrated  feature 
pyramid  and  hidden  unit  input  from  networks  lower  in  the  hierarchy.  Networks  are 
trained  either  coarse-to-fine  or  fine-to-coarse,  depending  on  the  architecture.  In  the 
coarse-to-fine  HPNN,  the  network  lowest  in  the  hierarchy  is  first  trained  until 
convergence  and  then  all  parameters  in  this  network  are  held  fixed  while  the  next 
network  on  the  hierarchy  is  trained.  Coarse-to-fine  training  is  possible  because  the 
positions  of  the  small  objects  are  well-defined  when  the  resolution  is  decreased.  For  the 
fine-to-coarse  HPNN,  extended  objects  do  not  have  a  definite  location  at  high  resolution. 
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The  entire  hierarchy  of  networks  is  therefore  trained  as  a  single  N-layered  network  {N 
being  a  function  of  the  number  of  layers  per  network  and  the  number  of  networks  in  the 
hierarchy).^  Input  for  both  training  and  testing  is  raster  scanned  into  each  network  so  that 
the  output  of  a  network  at  any  level  is  an  image.  For  both  HPNN  architectures  the  output 
of  the  network  is  an  image  representing  the  probability  that  an  object  is  present  at  each 
x,y  position.  For  the  coarse-to-fine  architecture  each  output  pixel  represents  the 
probability  of  a  point-like  object  (e.g.  a  microcalcification),  while  for  the  fine-to-coarse 
architecture  each  output  pixel  represents  the  probability  that  a  large  extended  object  (e.g. 
a  mass)  is  within  that  low-resolution  pixel. 

C.  Learning  Algorithm 

The  conventional  error  function  for  training  a  neural  network  on  a  binary  detection 
problem  is  the  cross-entropy  error  function,  which  is  the  negative  logarithm  of  the 
probability  that  the  network  produces  detection  decisions  that  agree  with  the  targets  in  the 
training  data.  It  is  given  by 


E  =  -l [di  log  y,-  -  (1  - di ) logd -yt)]  (4) 

i 


where  d.  e  {0,1}  is  the  desired  output  and  y,  is  the  actual  output  of  the  neuron,  given  by 
equation  3.  For  image-based  detection,  since  networks  are  typically  applied  across  a  set 
of  pixels,  both  y,  and  di  are  a  function  of  position;  yi(x,y),di(x,y).  Thus  every  position 


^  Error  back-propagation  through  the  pyramid  reduction  operations  is  straightforward,  since  this  operation 
is  linear. 


10 


in  an  image  is  either  associated  with  the  presence,  (x,  )^)  =  1 ,  or  absence,  (x,  y)  =  0 , 

of  a  target. 

In  examining  the  truth  data  for  the  mammographic  ROI  datasets,  we  found  that 
radiologists  often  make  small  errors  in  localizing  individual  microcalcifications  and 
masses.  For  microcalcifications,  these  errors  appear  to  be  within  +/-2  pixels  of  the 
correct  position.  For  masses,  the  positional  error  also  includes  the  extent  of  the  mass — 
masses  have  ill-defined  borders  that  are  not  easily  ground-truthed,  even  by  an  expert.  If 
the  exact  positions  of  the  objects  are  unknown  then  the  probability  of  detecting  the 
objects  at  the  correct  positions  cannot  be  evaluated  and  using  equation  4  will  result  in 
poor  performance,  as  will  be  illustrated  below. 

Consider  instead  the  probability  of  detecting  an  object  of  interest  when  detection  is 
defined  as  at  least  one  pixel  detected  within  a  certain  region  known  to  contain  the  object. 
For  a  dataset  with  a  coordinate  vector  for  each  object,  let  x,  represent  the  coordinates  of 
the  i*  object.^  Define  a  region  P,  as  set  of  pixel  locations  for  the  i^*^  object  that  incorporate 
the  known  magnitude  of  the  uncertainty  or  positional  error  in  the  truth  data.  A  single 
detection  within  Pi  will  represent  the  detection  of  the  i*  object.  Denote  the  output  of  the 
network  when  applied  to  the  input  vector  derived  from  the  neighborhood  of  x,.  to  be 

y(x. ) .  The  probability  of  the  network  producing  at  least  one  detection  in  P,-  is  one  minus 


^  Note  that  for  analysis  of  2D  imagery,  such  as  mammograms,  X,.  =  {x,  y}.  However  the 
formulation  can  be  extended  across  an  arbitrary  coordinate  space,  so  we  use  x,  for  generality. 
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the  probability  of  producing  no  detection  in  P,  ,  or  1-  na  -  y(x)) .  As  with  cross¬ 

es 


entropy,  the  probability  of  not  detecting  an  object  at  a  negative  position  x,  is  1  -  y(jc,. ) .  If 


we  define  N  as  the  set  of  all  know  negative  locations  then  the  new  error  function 
becomes; 


^UOP  ~ 


(5) 


xePi 


XjeN 


We  call  this  the  Uncertain  Object  Position  (UOP)  error  function.  The  first  term  of 
equation  5  is  the  probability  of  detecting  at  least  one  pixel  in  a  positive  region  while  the 
second  term  is  the  probability  of  no  detection  in  a  negative  region.  The  gradient  of  Euop 
with  respect  to  the  network  weights  is 


dw 


no-)’(j)) 

xePi  ^oy(x)/ow 

Yl(^-y(x))-li^^(l-y(x)) 

xePi 


1  dy(x) 


Si-y(^)  dw 


(6) 


which  is  used  in  an  optimization  loop  for  training."^ 


‘  a  '"weight  decay"  regularization  term,  r  =  — 


,  is  added  to  the  error  functions  to  prevent 


the  networks  from  becoming  "over-trained".  A  was  adjusted  to  minimize  the  cross-validation 
error,  computed  by  dividing  the  training  data  into  disjoint  subsets  whose  union  is  the  entire  set. 
The  network  was  first  trained  on  all  of  the  training  data,  and  then,  starting  from  this  set  of 
weights,  the  network  was  retrained  on  the  data  with  one  of  the  subsets  left  out.  The  resulting 
network  was  tested  on  the  "holdout"  subset.  This  retraining  and  testing  with  a  holdout  set  was 
repeated  for  each  of  the  subsets,  and  the  average  of  the  errors  on  the  subsets  is  the  cross- 
validation  error,  an  unbiased  estimate  of  the  average  error  on  new  data. 
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As  an  illustration  of  the  utility  of  the  UOP  error  function,  we  compare  the  detection 
performance,  with  a  network  trained  using  cross-entropy,  for  a  “toy  problem”  as  shown 
in  figure  4.  A  lO-by-10  grid  of  single  pixel  objects  was  embedded  in  a  noisy  background. 
Single  pixel  objects  were  assigned  a  pixel  value  of  one,  while  background  pixels  had  a 
value  of  one-half  or  zero  randomly  assigned  with  equal  probability.  Errors  were 
introduced  into  the  truth  data  by  randomly  shifting  the  truth  data  within  a  3-by-3  pixel 
neighborhood  centered  around  the  object’s  true  position  (see  figure  5b).  A  “network” 
consisting  of  a  single  sigmoidal  neuron,  with  activation  and  transfer  functions  as  in 
equations  2  and  3,  was  used  to  search  the  image  for  the  objects.  At  a  given  location 
X  =  {jc,  y}the  inputs  to  the  network  are  nine  pixel  values  from  a  3-by-3  window  in  the 

input  image,  centered  on  x . 


Kinsert  figure  4  hero 

In  figure  4,  the  truth  image  shows  both  the  single  point  truth  data  and  the  square  3-by-3 
region  around  these  pixels.  The  images  in  figure  4d  and  e  are  the  outputs  of  the  network 
trained  using  the  cross-entropy  error  function.  The  cross-entropy  trained  network  with  the 
output  in  figure  4d  was  trained  using  single  point  truth  data  while  the  network  with  the 
output  shown  in  figure  4e  was  trained  using  the  3-by-3  region  truth  data.  Figure  4f  is  the 
output  of  the  network  trained  using  the  UOP  error  function  with  positive  regions  P,-  as 
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shown  in  figure  4c.  As  is  evident  from  the  figure,  the  UOP  trained  network  produces 
qualitatively  superior  results. 

We  measured,  quantitatively,  the  detection  performance  of  the  networks  by  computing 
the  sensitivity  and  false  positive  rates  on  the  data.  For  the  cross  entropy  trained  networks 
sensitivity  was  90%  with  a  7.5%  pixel  false  positive  rate.  For  the  UOP  trained  network, 
sensitivity  was  100%  with  a  0%  false  positive  rate. 

III.  Results 

A.  The  Experimental  Paradigm 

We  conducted  a  series  of  experiments  to  determine  the  utility  of  the  HPNN  architecture 
for  mammographic  CAD.  The  goal  of  the  first  set  of  experiments  was  to  validate  our 
hierarchical  network  architecture  and  learning  algorithms  for  capturing  contextual 
information  and  to  demonstrate  improved  detection  performance,  relative  to  traditional 
neural  network  architectures.  The  second  set  of  experiments  focused  on  a  quantitative 
and  rigorous  evaluation  of  the  HPNN,  in  particular  evaluation  of  two  architectures  for 
reducing  the  false  positive  rate  of  the  state-of-the-art  CAD  systems  developed  by  UofC. 
Finally,  as  a  demonstration  of  clinical  utility,  we  integrated  the  HPNN  with  a  UofC  CAD 
system  and  evaluated  its  performance  in  a  Reader  Study. 
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B.  Validation  of  the  network  hierarchy  architecture 

Three  neural  network  architectures  were  evaluated,  each  having  one  hidden  layer  with  4- 
8  hidden  units.^  A  two  level  coarse-to-fine  IFP  was  constructed  and  used  as  input  to  the 
different  network  architectures.  As  shown  in  figure  5,  network  A  consists  of  a  single 
network  processing  data  from  the  coarsest  resolution  of  the  IFP,  network  B  is  a  single 
network  receiving  input  from  all  levels  of  the  IFP  and  network  C  is  a  2  level  coarse-to- 
fine  HPNN.  The  networks  had  activation  and  transfer  functions  described  previously 
(equations  2  &3)  and  were  trained  using  cross-entropy  error  (equation  4). 

We  trained  the  networks  on  five  mammograms.  Each  mammogram  had  one  or  two 
clusters  with  approximately  20  microcalcifications  per  mammogram,  for  a  total  of  97. 

The  results  given  below  were  measured  on  five  test  mammograms  with  one  cluster  each, 
for  a  total  of  95  microcalcifications. 

Kinsert  figure  5  here> 

Results  for  the  three  networks  are  shown  as  receiver  operating  characteristic  (ROC) 
curves  [23]  in  Figure  5.  Note  the  improvement  as  finer  resolution  information  is  added  to 
the  network  (networks  A  vs  B)  and  especially  the  very  large  improvement  when  using  the 
hierarchical  network  architecture  (networks  A&B  vs.  C).  We  considered  whether 
network  C  was  in  fact  taking  advantage  of  context  information  by  examining  the 
representations  developed  by  various  hidden  units  in  the  network.  Figure  6  shows 
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outputs  of  two  classes  of  hidden  units.  The  first  class  (figure  6  B)  appears  to  represent 
point-like  structure,  similar  to  the  structure  of  an  individual  microcalcification.  The 
second  class  of  hidden  unit  (figure  6C)  has  a  different  representation.  In  this  case,  the  unit 
is  selective  for  long,  extended,  and  oriented  structure.  When  shown  to  radiologists,  they 
noted  that  this  hidden  unit  structure  appeared  correlated  with  the  ductal  and  vascular 
anatomy.  As  mentioned  previously,  the  development  of  breast  cancer  is  often  correlated 
with  these  anatomical  structures.  Results  for  this  experiment  suggest  that  the  coarse-to- 
fine  hierarchical  neural  network  is  able  to  automatically  extract  information  that  is 
consistent  with  known  contextual  relationships  and  that  this  may  result  in  the  observed 
improvement  in  detection  performance. 

<insert  figure  6  here> 


C.  Validation  of  UOP  for  microcalcification  detection 

To  validate  the  utility  of  our  UOP  error  function  (equation  5)  for  mammographic  CAD 
we  conducted  experiments  comparing  detection  performance  with  the  cross  entropy  error 
funetion  (equation  4).  We  trained  and  tested  a  single  neuron  network  to  detect 
microcalcifications,  using  the  dataset  described  in  the  previous  experiment.  Expert 
radiologists  constructed  the  truth-data,  however  inspection  of  the  data  indicated 
positional  errors  of  up  to  2  pixels.  At  a  given  location  x ,  the  inputs  to  the  network  were 
the  25  pixel  values  in  a  5x5  window  in  the  input,  centered  on  x .  We  expect  that  the 


'  Model  complexity  was  controlled  for  by  adding/ subtracting  hidden  units  using  a  cross- 
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average  local  brightness  is  not  related  to  the  detection  problem.  Therefore,  to  enforce 
invariance  to  average  local  brightness  we  constrained  the  weights  of  the  single  unit 
network  to  sum  to  one. 

Figure  7  shows  results  for  a  test  mammogram.  Note  that  the  network  trained  using  UOP 
generates  fewer  false  positives  than  the  conventional  cross  entropy  error  function.  If 
thresholds  are  applied  to  the  networks  so  that  50%  of  the  true  positives  are  detected,  the 
UOP  trained  network  has  50%  fewer  false  positives  that  the  cross  entropy  network. 

<insert  figure  7  here> 

D.  Results  on  research  database:  microcalcincation  detection 

Given  results  for  the  previous  two  experiments  we  next  evaluated  the  performance  of  an 
HPNN  architecture  trained  using  the  UOP  error.  In  the  remaining  experiments  described 
in  this  paper  we  evaluated  the  performance  of  the  HPNN  as  a  post  processor  or  adjunct 
for  the  UofC  CAD  system. 

UofC  provided  data  used  for  the  microcalcification  experiments.  The  first  set  of  data 
consists  of  50  true  positive  and  86  false  positive  ROIs.  These  ROIs  are  99-by-99  pixels 
and  digitized  at  lOOpm  resolution.  A  second  set  of  data  from  the  UofC  clinical  testing 
database  included  47  true  positives  and  103  false  positives,  also  99-by-99  and  sampled  at 
lOOpm  resolution. 


validation  error. 
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We  trained  a  coarse-to-fine  HPNN  (figure  2A),  using  UOP  error  function,  to  detect 
individual  microcalcifications.  Training  and  testing  were  done  using  a  jackknife  protocol 
[24],  whereby  one  half  of  the  data  (25  TPs  and  43  FPs)  was  used  for  training  and  the 
other  half  for  testing.  Results  were  compiled  for  five  different  random  splits  of  the  data. 
For  a  given  ROI,  the  probability  map  produced  by  the  network  was  thresholded  at  a  given 
value  to  produce  a  binary  detection  map.  Region  growing  was  used  to  count  the  number 
of  distinct  detected  regions.  The  ROI  was  classified  as  a  positive  if  the  number  of 
regions  was  greater  than  or  equal  to  a  given  cluster  criterion. 

Table  1  compares  ROC  results  for  the  HPNN  and  the  shift-invariant  artificial  neural 
network  (SIANN)  network  that  had  been  used  in  the  UofC  CAD  system  [25].  Reported 
are  the  area  under  the  ROC  curve  (A^),  the  standard  deviation  of  across  the  subsets  of 
the  jackknife  (gaz),  the  false  positive  fraction  at  a  true  positive  fraction  of  1.0  (FPF 
@TPF=1.0)  and  the  standard  deviation  of  the  FPF  across  the  subsets  of  the  jackknife 
(Ofpf)-  Az  and  FPF@TPF=1.0  represent  the  averages  of  the  subsets  of  the  jackknife. 

Note  that  both  networks  operate  best  when  the  cluster  criterion  (cc)  is  set  to  two.  For  this 
case  the  HPNN  has  a  higher  than  the  SIANN  network  while  also  halving  the  false 
positive  rate.  This  difference,  between  the  two  networks'  Az  and  FPF  values,  is 
statistically  significant  (z-test;  paz=  0018,  pFPr=  00001) 


<insert  table  1  here> 
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The  second  set  of  data  was  tested  using  a  coarse-to-fme  HPNN  trained  on  the  first 
dataset.  150  ROIs  taken  from  a  clinical  study  and  classified  as  positive  by  the  full  UofC 
CAD  system  for  microcalcification  detection  (including  the  SIANN  neural  network)  were 
used  to  test  the  HPNN.  Though  the  UofC  CAD  system  classified  all  150  ROIs  as 
positive,  only  47  were  in  fact  positive  while  103  were  negatives — this  dataset  was 
overpopulated  with  false  positives.  We  applied  the  HPNN  trained  on  the  entire  previous 
data  set  to  this  new  set  of  ROIs.  The  HPNN  was  able  to  reclassify  47/103  negatives  as 
negative,  without  loss  in  sensitivity,  i.e.,  no  false  negatives  were  introduced. 

On  examining  the  negative  examples  rejected  by  the  coarse-to-fine  HPNN,  we  found  that 
many  of  these  ROIs  contained  linear,  high-contrast  structure  that  would  otherwise  be 
false  positives  for  the  SIANN  network  (see  figure  8).  One  possible  reason  for  this  is  that 
the  coarse-to-fine  HPNN  also  learns  context  for  the  false  positives.  SIANN  presumably 
interprets  the  “peaks”  on  the  linear  structure  as  calcifications.  However  because  the 
coarse-to-fine  HPNN  also  integrates  information  from  low  resolution  it  can  associate 
these  “peaks”  with  linear  structure  at  low  resolution  and  thus  determine  that  these  peaks 
are  not  microcalcifications.  This  is  an  interesting  difference  from  our  earlier  results,  in 
which  the  networks  appeared  to  learn  contextual  relationships  associated  with  positive 
examples — ductal  and  vascular  anatomy.  Thus  it  appears  that  the  HPNN  can  exploit 
contextual  relationships  to  both  detect  true  positives  and  eliminate  false  positives. 


<insert  figure  8  here> 
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E.  Results  on  a  research  data  base:  mass  detection 

The  next  set  of  experiments  applied  a  fine-to-coarse  HPNN  architecture  to  detect  masses 
in  digitized  mammograms.  Radiologists  often  distinguish  malignant  from  benign  masses 
based  on  the  detailed  shape  of  the  mass  border  and  the  presence  of  spicules  along  the 
border  [16].  We  evaluate  the  fine-to-coarse  HPNN,  figure  2B,  for  its  ability  to  integrate 
high-resolution  information  within  the  context  of  coarse-scale  mass  structure. 

The  experimental  paradigm  is  similar  to  the  microcalcification  experiments  in  that  we 
apply  the  HPNN  as  a  post-processor  to  the  UofC  CAD  system  for  mass  detection.  The 
data  in  our  study  consists  of  72  positive  and  100  negative  ROIs.  The  negative  ROIs  are 
false-positives  of  the  earlier  stages  of  the  CAD  system.  These  are  256-by-256  pixels  and 
are  sampled  at  200|xm  resolution. 

Results  for  the  fine-to-coarse  HPNN  system  are  shown  in  Table  2.  The  Az  value  on  the 
test  set  was  0.85.  These  results  show  a  51%  reduction  in  false  positive  rate  of  the  UofC 
mass  detection  system  without  loss  in  sensitivity. 

<insert  table  2  here> 


F.  Results  in  Clinical  evaluation 

As  a  final  test  of  the  utility  of  the  HPNN  architecture  a  clinical  reader  study  was 
conducted  to  evaluate  the  performance  of  the  combined  HPNN/UoC  system  for 
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microcalcification  detection.^  Details  of  the  reader  study  have  been  described  previously 
[26].  In  this  paper  we  summarize  the  results. 

Table  3  outlines  the  protocol.  Approximately  900  retrospective  mammographic  cases 
were  collected  and  read  by  ten  readers.  Five  readers  were  considered  experts  in 
mammography  (spent  over  50%  of  their  time  reading  mammograms)  and  the  other  five 
were  general  radiologists  who  were  MQSA  certified  [27].  Films  were  read  in  two 
conditions;  film  only  (unaided)  or  film  +  computer  results  (aided). 

Kinsert  table  3  here> 

Results  of  the  computer  output  alone  are  shown  in  Table  4.  Note  that  on  this  new  dataset 
the  HPNN  continues  to  reduce  the  false  positive  rate  of  the  microcalcification  CAD 
system. 


<insert  table  4  here> 

The  clinical  utility  of  the  complete  system,  which  includes  the  CAD  systems  for  mass 
detection  and  the  HPNN  enhanced  system  for  microcalcification  detection,  is  shown  in 
Table  5,  comparing  reader  performance  with  and  without  the  computer  aid.  Expert 
readers  showed  a  statistically  significant  improvement  when  using  the  CAD  system, 
however  the  improvement  was  not  statistically  significant  for  the  general  radiologists. 


‘  In  this  clinical  evaluation  only  the  coarse-to-fine  HPNN  for  microcalcification  was  integrated 
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One  possible  reason  is  that  false  positives  continue  to  be  an  issue,  since  experts  are  better 
than  general  radiologists  at  negating  or  ignoring  these  false  positives.  Additional  analysis 
is  required  to  understand  the  difference  between  the  two  groups.  However  the  overall 
results  show  that  the  CAD  system,  which  included  the  HPNN,  can  potentially  improve 
performance  of  mammographic  screening,  in  this  case  for  more  experienced  radiologists. 

<insert  table  5  here> 


IV.  Discussion 

In  this  paper  we  have  demonstrated  coarse-to-fine  and  fine-to-coarse  HPNN  architectures 
that  learn  contextual  relationships  for  detecting  microcalcifications  and  masses  in 
digital/digitized  mammograms.  Though  the  architectures  are  novel,  they  bear  some 
resemblance  to  previous  network  architectures.  For  example,  the  fine-to-coarse  HPNN  is 
similar  to  the  convolution  network  proposed  by  Le  Cun,  [28],  however  with  a  few  notable 
differences.  The  fine-to-coarse  HPNN  receives  as  inputs  preset  features  extracted  from 
the  image  (in  this  case  radial  and  tangential  gradients)  at  each  resolution,  compared  to  the 
convolution  network,  whose  inputs  are  the  original  pixel  values  at  the  highest  resolution. 
Secondly,  in  the  fine-to-coarse  HPNN,  the  inputs  to  a  hidden  unit  at  a  particular  position 
are  the  pixel  values  at  that  position  in  each  of  the  feature  images,  one  pixel  value  per 
feature  image.  Thus  the  HPNN’s  hidden  units  do  not  learn  linear  filters,  except  as  linear 
combinations  of  the  filters  used  to  form  the  features.  Finally  the  fine-to-coarse  HPNN  is 
also  trained  using  the  UOP  error  function,  which  is  not  used  in  the  convolution  network. 


with  the  UofC  CAD  and  evaluated. 
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The  two  architectures  we  have  described  can  be  combined  into  a  more  general 
architecture  that  integrates  information  both  coarse-to-fine  and  fine-to-coarse.  This  bi¬ 
directional  integration,  shown  in  the  architecture  of  figure  9,  is  attractive  in  that  most 
objects  can  be  considered  to  have  a  “natural  scale”  —  typically  some  measure  of  their 
size.  Classification  of  the  object  might  be  improved  through  integration  of  finer  and 
coarser  resolution  information,  relative  to  this  natural  scale.  Since  size  can  vary  within  a 
class  of  objects,  it  may  be  worthwhile  to  include  outputs  at  more  than  one  level  of  the 
HPNN.  In  this  case,  the  UOP  error  (Equation  5)  needs  to  be  modified  to  include 
uncertainty  over  scale,  but  this  is  easily  accomplished  by  changing  the  product  to  range 
over  positions  at  all  output  levels.  We  can  further  generalize  the  architecture  by  adding 
connections  between  the  fine-to-coarse  and  coarse-to-fine  paths,  but  one  must  be  careful 
to  avoid  loops  when  deciding  where  these  connections  should  be  added.  We  are 
currently  investigating  the  application  of  this  generalized  HPNN  architecture  to  mass 
detection. 


<insert  figure  9  hero 

Most  of  our  results  were  reported  relative  to  the  UofC  CAD  mammographic  systems, 
since  they  are  considered  to  be  well-characterized  and  state-of-the-art.  UofC  is 
continuing  to  improve  upon  their  systems  and  our  current  results  are  only  meant  as  a 
comparison  to  a  given  standard  at  a  given  point  in  time.  An  issue  in  CAD  research  is  the 
need  for  the  development  of  appropriate  benchmarks  for  comparing  different  algorithms. 
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Several  datasets  are  being  developed  which  might  eventually  support  such  comparisons 
though  they  have  yet  to  be  widely  accepted 

V.  Conclusion 

We  have  presented  the  application  of  hierarchical  pyramid  neural  network  architectures 
to  two  problems  in  computer-aided  diagnosis;  the  detection  of  microcalcifications  in 
mammograms  and  the  direct  detection  of  masses  in  mammograms.  In  the  case  of 
microcalcifications,  the  coarse-to-fine  HPNN  architecture  successfully  discovered  large- 
scale  context  information  that  improves  the  system's  performance  in  detecting  small 
objects.  A  coarse-to-fine  HPNN  has  been  directly  integrated  with  the  UofC  CAD  system 
for  microcalcification  detection  and  the  complete  system  has  been  tested  in  clinical  reader 
study.  In  the  case  of  mass  detection,  a  fine-to-coarse  HPNN  architecture  was  used  to 
exploit  information  from  fine  resolution  detail  in  order  to  eliminate  false  positives.  In 
general,  we  have  found  that  the  HPNN  is  a  useful  class  of  network  architecture  for 
exploiting  context  and  integrating  information  at  multiple  scales. 
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Figures 


digital/digitized 

mammogram 


suspicious  locations 


Figure  1  Processing  in  a  CAD  system. 


A  B 

Figure  2:  Hierarchical  pyramid/neural  network  architectures.  (A)  fine-to-coarse  and  (B)  coarse-to-fine.  In  (A)  context  is 
propagated  from  low  to  high  resolution  via  the  hidden  units  of  low-resolution  networks.  In  (B)  small  scale  detail 
Information  Is  propagated  from  high  to  low  resolution.  In  both  cases  the  output  of  the  last  Integration  network  is  an 
estimate  of  the  probability  that  a  target  is  present.  Arrow  shows  direction  of  information  flow. 
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Figure  3  Integrated  Feature  Pyramids  (iFPs)  for  A)  coarse-to-fine  and  B)  fine-to-coarse  HPNN. 
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P 


Figure  4:  ‘Toy  problem”  illustrating  performance  of  UOP  error  function  versus  cross-entropy  error.  (A)  Image 
consisting  of  1 0x1 0  grid  of  white  dots  in  a  background  of  random  binary  noise.  (B)  Single  point  truth  data  with 
positionai  error.  (C)  Truth  data  created  by  considering  the  magnitude  of  the  positional  error  (+/-1  pixel  results  in  3X3 
regions).  (D)  Output  for  network  trained  using  cross-entropy  error  and  truth  data  in  B.  (E)  Output  of  network  trained 
using  cross-entropy  error  and  truth-data  in  C.  (F)  Output  of  network  trained  using  UOP  error  and  truth  data  in  C. 
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Figure  6  (A)  Original  mammogram,  (B)  hidden  unit  representations  for  networks  operating  at  high  resolutions 
(C)  hidden  unit  representations  for  networks  operating  at  low  resolutions.  Radiologists  have  noted  that  some 
of  the  structure  in  C  appears  to  correlate  with  specific  anatomy  in  the  breast  (e.g.  ducts  and/or  blood  vessels), 
indicating  that  these  hidden  units  may  represent  contextual  information. 


A  B  C  D 


Figure  7  Detecting  microcalcifications  using  UOP  error  function.  The  upper  row  contains  reduced 
resolution  images  from  one  full  size  test  mammogram.  The  lower  row  shows  a  region  of  interest 
at  full  resolution.  (A)  image  (B)  truth  data  (C)  output  of  UOP  trained  network  (D)  output  of  cross 
entropy  trained  network. 


Linear  context 

Figure  8  Typical  negative  ROI  that  was  eliminated  by  the  coarse-to-fine  HPNN  for  microcalcification  detection.  The 
HPNN  is  able  to  associate  the  intensity  peaks,  which  in  isolation  may  be  interpreted  as  microcalcfications,  with  the 
coarse-scale  linear  structure  in  order  to  classify  the  ROI  as  a  negative. 
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Figure  9  Generalized  HPNN  architecture.  Integration  is  bi-directional  with  output  networks  at  the  “natural  scale”  of  the 
object.  The  natural  scale  may  be  known  a  priori  or  it  can  be  searched  for  by  optimizing  over  several  output  networks 
(e.g.  search  for  the  best  one  over  the  two  output  networks  shown  above). 
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Tables 


Table  1  Comparison  of  HPNN  and  SIANN  networks 


HPNN 

SIANN 

A, 

^Az 

FPF 

TPF=1.0 

Gfpf 

1 

Az 

^Az 

FPF 

TPF=1.0 

Gfpf 

1 

.93 

.03 

.24 

.11 

.88 

.04 

.50 

.11 

2 

.94 

.02 

.21 

.11 

.91 

.02 

.43 

3 

.94 

.39 

.19 

.91 

.48 

.19 

4 

.93 

.48 

.15 

EH 

.56 

.21 

5 

.51 

.06 

.88 

EaBI 

.68 

.21 

Table  2  Sensitivity  and  specificity  for  fine-to-coarse  HPNN  for  mass  detection 


infffRTRsrn 

100% 

51% 

95% 

57% 

90% 

67% 

80% 

79% 

Table  3:  Summary  of  Reader  Study  protocol. 

899  cases  (4  standard  views,  original  mammograms) 

•  501  normals  (including  10  atypia) 

•  199  benign 

•  199  malignant  (58  DCIS+141  invasive)  (22%) 


two  reading  conditions: 

•  film  only 

•  film  +  computer  results 

•  films  were  mounted  on  alternators 

•  computer  results  were  shown  on  CRT  monitors 
standard  observer  study  protocol 

»  training  session  randomized  reading  order,  etc. 

10  readers: 

•  5  specialists  (>50%  breast  imaging) 

•  5  general  radiologists  (MQSA  certified) _ 
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Table  4:  False  positive  rates  of  CAD  System 


CAD  Program 

Number  false 
positives  per 
image  (at  fixed 
sensitivity) 

Mass  detection 

1.6 

Microcalc  detection  (no  HPNN) 

1.04 

Microcalc  detection  (with  HPNN) 

0.88 

Table  5:  Reader  study  results  using  CAD  system. 


Reader 

Specialists 

General  Radiologists 

Unaided 

Aided 

Unaided 

Aided 

1 

0.851 

0.878 

0.813 

0.824 

2 

0.891 

0.911 

0.862 

0.876 

3 

0.878 

0.898 

0.881 

0.888 

4 

0.911 

0.914 

0.876 

0.863 

5 

0.899 

0.892 

avg 

0.866 

0.869 

p  value 

0.( 
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0.19 
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